pandas - vectorized formula computation with nans









up vote
0
down vote

favorite












I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:



 ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630


I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:



weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )


However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.



Are there any smart solutions to this problem?










share|improve this question





















  • What are you trying to do?
    – coldspeed
    Nov 9 at 22:12










  • For every row, I want to apply a formula on that row that gives me a new value for each column. I want to expand that operation downwards without looping over every row.
    – Évariste Galois
    Nov 9 at 22:13










  • My question is what is that operation? Perhaps a sickit learn package will do it better.
    – coldspeed
    Nov 9 at 22:14










  • The formula for 1 row is: w = (-1/N) * (r_i - r_mean), where N is the number of non-nan values in the row, r_i is each column value, and r_mean is the average across the columns.
    – Évariste Galois
    Nov 9 at 22:24










  • When you say "the program gets stuck", do you mean that it raises an error? If so, could you please include the text of the error message in your question?
    – tel
    Nov 10 at 2:06














up vote
0
down vote

favorite












I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:



 ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630


I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:



weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )


However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.



Are there any smart solutions to this problem?










share|improve this question





















  • What are you trying to do?
    – coldspeed
    Nov 9 at 22:12










  • For every row, I want to apply a formula on that row that gives me a new value for each column. I want to expand that operation downwards without looping over every row.
    – Évariste Galois
    Nov 9 at 22:13










  • My question is what is that operation? Perhaps a sickit learn package will do it better.
    – coldspeed
    Nov 9 at 22:14










  • The formula for 1 row is: w = (-1/N) * (r_i - r_mean), where N is the number of non-nan values in the row, r_i is each column value, and r_mean is the average across the columns.
    – Évariste Galois
    Nov 9 at 22:24










  • When you say "the program gets stuck", do you mean that it raises an error? If so, could you please include the text of the error message in your question?
    – tel
    Nov 10 at 2:06












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:



 ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630


I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:



weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )


However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.



Are there any smart solutions to this problem?










share|improve this question













I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:



 ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630


I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:



weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )


However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.



Are there any smart solutions to this problem?







python pandas numpy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 9 at 22:09









Évariste Galois

30912




30912











  • What are you trying to do?
    – coldspeed
    Nov 9 at 22:12










  • For every row, I want to apply a formula on that row that gives me a new value for each column. I want to expand that operation downwards without looping over every row.
    – Évariste Galois
    Nov 9 at 22:13










  • My question is what is that operation? Perhaps a sickit learn package will do it better.
    – coldspeed
    Nov 9 at 22:14










  • The formula for 1 row is: w = (-1/N) * (r_i - r_mean), where N is the number of non-nan values in the row, r_i is each column value, and r_mean is the average across the columns.
    – Évariste Galois
    Nov 9 at 22:24










  • When you say "the program gets stuck", do you mean that it raises an error? If so, could you please include the text of the error message in your question?
    – tel
    Nov 10 at 2:06
















  • What are you trying to do?
    – coldspeed
    Nov 9 at 22:12










  • For every row, I want to apply a formula on that row that gives me a new value for each column. I want to expand that operation downwards without looping over every row.
    – Évariste Galois
    Nov 9 at 22:13










  • My question is what is that operation? Perhaps a sickit learn package will do it better.
    – coldspeed
    Nov 9 at 22:14










  • The formula for 1 row is: w = (-1/N) * (r_i - r_mean), where N is the number of non-nan values in the row, r_i is each column value, and r_mean is the average across the columns.
    – Évariste Galois
    Nov 9 at 22:24










  • When you say "the program gets stuck", do you mean that it raises an error? If so, could you please include the text of the error message in your question?
    – tel
    Nov 10 at 2:06















What are you trying to do?
– coldspeed
Nov 9 at 22:12




What are you trying to do?
– coldspeed
Nov 9 at 22:12












For every row, I want to apply a formula on that row that gives me a new value for each column. I want to expand that operation downwards without looping over every row.
– Évariste Galois
Nov 9 at 22:13




For every row, I want to apply a formula on that row that gives me a new value for each column. I want to expand that operation downwards without looping over every row.
– Évariste Galois
Nov 9 at 22:13












My question is what is that operation? Perhaps a sickit learn package will do it better.
– coldspeed
Nov 9 at 22:14




My question is what is that operation? Perhaps a sickit learn package will do it better.
– coldspeed
Nov 9 at 22:14












The formula for 1 row is: w = (-1/N) * (r_i - r_mean), where N is the number of non-nan values in the row, r_i is each column value, and r_mean is the average across the columns.
– Évariste Galois
Nov 9 at 22:24




The formula for 1 row is: w = (-1/N) * (r_i - r_mean), where N is the number of non-nan values in the row, r_i is each column value, and r_mean is the average across the columns.
– Évariste Galois
Nov 9 at 22:24












When you say "the program gets stuck", do you mean that it raises an error? If so, could you please include the text of the error message in your question?
– tel
Nov 10 at 2:06




When you say "the program gets stuck", do you mean that it raises an error? If so, could you please include the text of the error message in your question?
– tel
Nov 10 at 2:06












1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:




fill_value : None or float value, default None



Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing




So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.



However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaNanything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233893%2fpandas-vectorized-formula-computation-with-nans%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:




    fill_value : None or float value, default None



    Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing




    So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.



    However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaNanything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.






    share|improve this answer


























      up vote
      2
      down vote



      accepted










      The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:




      fill_value : None or float value, default None



      Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing




      So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.



      However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaNanything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.






      share|improve this answer
























        up vote
        2
        down vote



        accepted







        up vote
        2
        down vote



        accepted






        The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:




        fill_value : None or float value, default None



        Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing




        So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.



        However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaNanything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.






        share|improve this answer














        The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:




        fill_value : None or float value, default None



        Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing




        So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.



        However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaNanything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 10 at 2:18

























        answered Nov 10 at 2:09









        tel

        3,3011427




        3,3011427



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233893%2fpandas-vectorized-formula-computation-with-nans%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Use pre created SQLite database for Android project in kotlin

            Darth Vader #20

            Ondo