Filtering pandas dataframe by day









up vote
2
down vote

favorite












I have a pandas data frame with forex data by minutes, one year long (371635 rows):



 O H L C
0
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203


I want to filter daily data to get an hour range:



dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')


If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line



df_day = df1[df.index.date == dt.date()] 


is looking for the equality with every row in the data set (even if it is an ordered data set).

Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?










share|improve this question



























    up vote
    2
    down vote

    favorite












    I have a pandas data frame with forex data by minutes, one year long (371635 rows):



     O H L C
    0
    2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
    2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
    2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
    2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
    2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203


    I want to filter daily data to get an hour range:



    dt = datetime(2017,1,1)
    df_day = df1[df.index.date == dt.date()]
    df_day_t = df_day.between_time('08:30', '09:30')


    If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line



    df_day = df1[df.index.date == dt.date()] 


    is looking for the equality with every row in the data set (even if it is an ordered data set).

    Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?










    share|improve this question

























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I have a pandas data frame with forex data by minutes, one year long (371635 rows):



       O H L C
      0
      2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
      2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
      2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
      2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
      2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203


      I want to filter daily data to get an hour range:



      dt = datetime(2017,1,1)
      df_day = df1[df.index.date == dt.date()]
      df_day_t = df_day.between_time('08:30', '09:30')


      If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line



      df_day = df1[df.index.date == dt.date()] 


      is looking for the equality with every row in the data set (even if it is an ordered data set).

      Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?










      share|improve this question















      I have a pandas data frame with forex data by minutes, one year long (371635 rows):



       O H L C
      0
      2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
      2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
      2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
      2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
      2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203


      I want to filter daily data to get an hour range:



      dt = datetime(2017,1,1)
      df_day = df1[df.index.date == dt.date()]
      df_day_t = df_day.between_time('08:30', '09:30')


      If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line



      df_day = df1[df.index.date == dt.date()] 


      is looking for the equality with every row in the data set (even if it is an ordered data set).

      Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?







      python pandas performance datetime algorithmic-trading






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 10 at 0:09









      jpp

      84.5k194897




      84.5k194897










      asked Nov 9 at 23:57









      Stefano Piovesan

      4252824




      4252824






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          Avoid Python datetime



          First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:




          pd.Series.dt.date returns an array of python datetime.date objects




          Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.



          Use groupby operations for aggregating by date



          Pandas already has functionality to group by date via normalizing time:



          for day, df_day in df.groupby(df.index.floor('d')):
          df_day_t = df_day.between_time('08:30', '09:30')
          # do something


          As another example, you can access a slice for a particular day in this way:



          g = df.groupby(df.index.floor('d'))
          my_day = pd.Timestamp('2017-01-01')
          df_slice = g.get_group(my_day)





          share|improve this answer
















          • 1




            I didn't know about groupby in pandas... It takes milliseconds now!
            – Stefano Piovesan
            Nov 10 at 8:05











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234770%2ffiltering-pandas-dataframe-by-day%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          Avoid Python datetime



          First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:




          pd.Series.dt.date returns an array of python datetime.date objects




          Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.



          Use groupby operations for aggregating by date



          Pandas already has functionality to group by date via normalizing time:



          for day, df_day in df.groupby(df.index.floor('d')):
          df_day_t = df_day.between_time('08:30', '09:30')
          # do something


          As another example, you can access a slice for a particular day in this way:



          g = df.groupby(df.index.floor('d'))
          my_day = pd.Timestamp('2017-01-01')
          df_slice = g.get_group(my_day)





          share|improve this answer
















          • 1




            I didn't know about groupby in pandas... It takes milliseconds now!
            – Stefano Piovesan
            Nov 10 at 8:05















          up vote
          2
          down vote



          accepted










          Avoid Python datetime



          First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:




          pd.Series.dt.date returns an array of python datetime.date objects




          Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.



          Use groupby operations for aggregating by date



          Pandas already has functionality to group by date via normalizing time:



          for day, df_day in df.groupby(df.index.floor('d')):
          df_day_t = df_day.between_time('08:30', '09:30')
          # do something


          As another example, you can access a slice for a particular day in this way:



          g = df.groupby(df.index.floor('d'))
          my_day = pd.Timestamp('2017-01-01')
          df_slice = g.get_group(my_day)





          share|improve this answer
















          • 1




            I didn't know about groupby in pandas... It takes milliseconds now!
            – Stefano Piovesan
            Nov 10 at 8:05













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          Avoid Python datetime



          First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:




          pd.Series.dt.date returns an array of python datetime.date objects




          Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.



          Use groupby operations for aggregating by date



          Pandas already has functionality to group by date via normalizing time:



          for day, df_day in df.groupby(df.index.floor('d')):
          df_day_t = df_day.between_time('08:30', '09:30')
          # do something


          As another example, you can access a slice for a particular day in this way:



          g = df.groupby(df.index.floor('d'))
          my_day = pd.Timestamp('2017-01-01')
          df_slice = g.get_group(my_day)





          share|improve this answer












          Avoid Python datetime



          First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:




          pd.Series.dt.date returns an array of python datetime.date objects




          Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.



          Use groupby operations for aggregating by date



          Pandas already has functionality to group by date via normalizing time:



          for day, df_day in df.groupby(df.index.floor('d')):
          df_day_t = df_day.between_time('08:30', '09:30')
          # do something


          As another example, you can access a slice for a particular day in this way:



          g = df.groupby(df.index.floor('d'))
          my_day = pd.Timestamp('2017-01-01')
          df_slice = g.get_group(my_day)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 10 at 0:05









          jpp

          84.5k194897




          84.5k194897







          • 1




            I didn't know about groupby in pandas... It takes milliseconds now!
            – Stefano Piovesan
            Nov 10 at 8:05













          • 1




            I didn't know about groupby in pandas... It takes milliseconds now!
            – Stefano Piovesan
            Nov 10 at 8:05








          1




          1




          I didn't know about groupby in pandas... It takes milliseconds now!
          – Stefano Piovesan
          Nov 10 at 8:05





          I didn't know about groupby in pandas... It takes milliseconds now!
          – Stefano Piovesan
          Nov 10 at 8:05


















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234770%2ffiltering-pandas-dataframe-by-day%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Use pre created SQLite database for Android project in kotlin

          Darth Vader #20

          Ondo