Filtering pandas dataframe by day
up vote
2
down vote
favorite
I have a pandas data frame with forex data by minutes, one year long (371635 rows):
O H L C
0
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203
I want to filter daily data to get an hour range:
dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')
If I do a for
loop with 200 days, it takes minutes. I suspect that at every step this line
df_day = df1[df.index.date == dt.date()]
is looking for the equality with every row in the data set (even if it is an ordered data set).
Is there any way I could speed up the filtering or I should just do some old imperative for
loop from January to December...?
python pandas performance datetime algorithmic-trading
add a comment |
up vote
2
down vote
favorite
I have a pandas data frame with forex data by minutes, one year long (371635 rows):
O H L C
0
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203
I want to filter daily data to get an hour range:
dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')
If I do a for
loop with 200 days, it takes minutes. I suspect that at every step this line
df_day = df1[df.index.date == dt.date()]
is looking for the equality with every row in the data set (even if it is an ordered data set).
Is there any way I could speed up the filtering or I should just do some old imperative for
loop from January to December...?
python pandas performance datetime algorithmic-trading
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a pandas data frame with forex data by minutes, one year long (371635 rows):
O H L C
0
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203
I want to filter daily data to get an hour range:
dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')
If I do a for
loop with 200 days, it takes minutes. I suspect that at every step this line
df_day = df1[df.index.date == dt.date()]
is looking for the equality with every row in the data set (even if it is an ordered data set).
Is there any way I could speed up the filtering or I should just do some old imperative for
loop from January to December...?
python pandas performance datetime algorithmic-trading
I have a pandas data frame with forex data by minutes, one year long (371635 rows):
O H L C
0
2017-01-02 02:00:00 1.05155 1.05197 1.05155 1.05190
2017-01-02 02:01:00 1.05209 1.05209 1.05177 1.05179
2017-01-02 02:02:00 1.05177 1.05198 1.05177 1.05178
2017-01-02 02:03:00 1.05188 1.05200 1.05188 1.05200
2017-01-02 02:04:00 1.05196 1.05204 1.05196 1.05203
I want to filter daily data to get an hour range:
dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')
If I do a for
loop with 200 days, it takes minutes. I suspect that at every step this line
df_day = df1[df.index.date == dt.date()]
is looking for the equality with every row in the data set (even if it is an ordered data set).
Is there any way I could speed up the filtering or I should just do some old imperative for
loop from January to December...?
python pandas performance datetime algorithmic-trading
python pandas performance datetime algorithmic-trading
edited Nov 10 at 0:09
jpp
84.5k194897
84.5k194897
asked Nov 9 at 23:57
Stefano Piovesan
4252824
4252824
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
Avoid Python datetime
First you should avoid combining Python datetime
with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime
objects for comparison, e.g. pd.Timestamp
and pd.to_datetime
. Your performance issues here are partly due to this behaviour described in the docs:
pd.Series.dt.date
returns an array of pythondatetime.date
objects
Using object
dtype in this way removes vectorisation benefits, as operations then require Python-level loops.
Use groupby
operations for aggregating by date
Pandas already has functionality to group by date via normalizing time:
for day, df_day in df.groupby(df.index.floor('d')):
df_day_t = df_day.between_time('08:30', '09:30')
# do something
As another example, you can access a slice for a particular day in this way:
g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)
1
I didn't know aboutgroupby
in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Avoid Python datetime
First you should avoid combining Python datetime
with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime
objects for comparison, e.g. pd.Timestamp
and pd.to_datetime
. Your performance issues here are partly due to this behaviour described in the docs:
pd.Series.dt.date
returns an array of pythondatetime.date
objects
Using object
dtype in this way removes vectorisation benefits, as operations then require Python-level loops.
Use groupby
operations for aggregating by date
Pandas already has functionality to group by date via normalizing time:
for day, df_day in df.groupby(df.index.floor('d')):
df_day_t = df_day.between_time('08:30', '09:30')
# do something
As another example, you can access a slice for a particular day in this way:
g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)
1
I didn't know aboutgroupby
in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05
add a comment |
up vote
2
down vote
accepted
Avoid Python datetime
First you should avoid combining Python datetime
with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime
objects for comparison, e.g. pd.Timestamp
and pd.to_datetime
. Your performance issues here are partly due to this behaviour described in the docs:
pd.Series.dt.date
returns an array of pythondatetime.date
objects
Using object
dtype in this way removes vectorisation benefits, as operations then require Python-level loops.
Use groupby
operations for aggregating by date
Pandas already has functionality to group by date via normalizing time:
for day, df_day in df.groupby(df.index.floor('d')):
df_day_t = df_day.between_time('08:30', '09:30')
# do something
As another example, you can access a slice for a particular day in this way:
g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)
1
I didn't know aboutgroupby
in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Avoid Python datetime
First you should avoid combining Python datetime
with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime
objects for comparison, e.g. pd.Timestamp
and pd.to_datetime
. Your performance issues here are partly due to this behaviour described in the docs:
pd.Series.dt.date
returns an array of pythondatetime.date
objects
Using object
dtype in this way removes vectorisation benefits, as operations then require Python-level loops.
Use groupby
operations for aggregating by date
Pandas already has functionality to group by date via normalizing time:
for day, df_day in df.groupby(df.index.floor('d')):
df_day_t = df_day.between_time('08:30', '09:30')
# do something
As another example, you can access a slice for a particular day in this way:
g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)
Avoid Python datetime
First you should avoid combining Python datetime
with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime
objects for comparison, e.g. pd.Timestamp
and pd.to_datetime
. Your performance issues here are partly due to this behaviour described in the docs:
pd.Series.dt.date
returns an array of pythondatetime.date
objects
Using object
dtype in this way removes vectorisation benefits, as operations then require Python-level loops.
Use groupby
operations for aggregating by date
Pandas already has functionality to group by date via normalizing time:
for day, df_day in df.groupby(df.index.floor('d')):
df_day_t = df_day.between_time('08:30', '09:30')
# do something
As another example, you can access a slice for a particular day in this way:
g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)
answered Nov 10 at 0:05
jpp
84.5k194897
84.5k194897
1
I didn't know aboutgroupby
in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05
add a comment |
1
I didn't know aboutgroupby
in pandas... It takes milliseconds now!
– Stefano Piovesan
Nov 10 at 8:05
1
1
I didn't know about
groupby
in pandas... It takes milliseconds now!– Stefano Piovesan
Nov 10 at 8:05
I didn't know about
groupby
in pandas... It takes milliseconds now!– Stefano Piovesan
Nov 10 at 8:05
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234770%2ffiltering-pandas-dataframe-by-day%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown