Filtering pandas dataframe by day

up vote
2
down vote

favorite

I have a pandas data frame with forex data by minutes, one year long (371635 rows):

 O H L C
0 
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203

I want to filter daily data to get an hour range:

dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')

If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line

df_day = df1[df.index.date == dt.date()]

is looking for the equality with every row in the data set (even if it is an ordered data set).

Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?

edited Nov 10 at 0:09

jpp

84.5k194897

asked Nov 9 at 23:57

Stefano Piovesan

4252824

add a comment |

up vote
2
down vote

favorite

I have a pandas data frame with forex data by minutes, one year long (371635 rows):

 O H L C
0 
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203

I want to filter daily data to get an hour range:

dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')

If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line

df_day = df1[df.index.date == dt.date()]

edited Nov 10 at 0:09

jpp

84.5k194897

asked Nov 9 at 23:57

Stefano Piovesan

4252824

add a comment |

up vote
2
down vote

favorite

I have a pandas data frame with forex data by minutes, one year long (371635 rows):

 O H L C
0 
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203

I want to filter daily data to get an hour range:

dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')

If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line

df_day = df1[df.index.date == dt.date()]

edited Nov 10 at 0:09

jpp

84.5k194897

asked Nov 9 at 23:57

Stefano Piovesan

4252824

I have a pandas data frame with forex data by minutes, one year long (371635 rows):

 O H L C
0 
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203

I want to filter daily data to get an hour range:

dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')

If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line

df_day = df1[df.index.date == dt.date()]

python pandas performance datetime algorithmic-trading

edited Nov 10 at 0:09

jpp

84.5k194897

asked Nov 9 at 23:57

Stefano Piovesan

4252824

edited Nov 10 at 0:09

jpp

84.5k194897

asked Nov 9 at 23:57

Stefano Piovesan

4252824

edited Nov 10 at 0:09

jpp

84.5k194897

edited Nov 10 at 0:09

jpp

84.5k194897

edited Nov 10 at 0:09

jpp

84.5k194897

asked Nov 9 at 23:57

Stefano Piovesan

4252824

asked Nov 9 at 23:57

Stefano Piovesan

4252824

asked Nov 9 at 23:57

Stefano Piovesan

4252824

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Avoid Python `datetime`

First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:

pd.Series.dt.date returns an array of python datetime.date objects

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

Use `groupby` operations for aggregating by date

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
 df_day_t = df_day.between_time('08:30', '09:30')
 # do something

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

answered Nov 10 at 0:05

jpp

84.5k194897

1

I didn't know about groupby in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234770%2ffiltering-pandas-dataframe-by-day%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

Avoid Python `datetime`

pd.Series.dt.date returns an array of python datetime.date objects

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

Use `groupby` operations for aggregating by date

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
 df_day_t = df_day.between_time('08:30', '09:30')
 # do something

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

answered Nov 10 at 0:05

jpp

84.5k194897

1

I didn't know about groupby in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05

add a comment |

up vote
2
down vote

accepted

Avoid Python `datetime`

pd.Series.dt.date returns an array of python datetime.date objects

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

Use `groupby` operations for aggregating by date

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
 df_day_t = df_day.between_time('08:30', '09:30')
 # do something

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

answered Nov 10 at 0:05

jpp

84.5k194897

1

I didn't know about groupby in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05

add a comment |

up vote
2
down vote

accepted

Avoid Python `datetime`

pd.Series.dt.date returns an array of python datetime.date objects

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

Use `groupby` operations for aggregating by date

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
 df_day_t = df_day.between_time('08:30', '09:30')
 # do something

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

answered Nov 10 at 0:05

jpp

84.5k194897

Avoid Python `datetime`

pd.Series.dt.date returns an array of python datetime.date objects

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

Use `groupby` operations for aggregating by date

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
 df_day_t = df_day.between_time('08:30', '09:30')
 # do something

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

answered Nov 10 at 0:05

jpp

84.5k194897

answered Nov 10 at 0:05

jpp

84.5k194897

answered Nov 10 at 0:05

jpp

84.5k194897

answered Nov 10 at 0:05

jpp

84.5k194897

1

I didn't know about groupby in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05

add a comment |

1

I didn't know about groupby in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05

I didn't know about groupby in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

cQLkA9YkE9Mf2LiSjW9YeKMHlgcV,oA,6TUihY6ud V uRzu

搜尋此網誌

Pfthb