pad a data frame according to a frequency for each group
I have a pandas.DataFrame
df
with a pandas.DatetimeIndex
and a column named group_column
.
I need the df
to have a minutely frequency (meaning there is a row for every minute).
however this needs to be case for every value in the group_column
, so every minute can potentially have several values.
NOTE:
- the
group_column
can have hundreds of unique values. - some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in
group_column
.
example
Input:
dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0
desired output:
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
my attempt
I have done this, however it seems highly inefficient:
def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g
df_paded = pd.concat(group_resmaple(df, 'group_column'))
python pandas
add a comment |
I have a pandas.DataFrame
df
with a pandas.DatetimeIndex
and a column named group_column
.
I need the df
to have a minutely frequency (meaning there is a row for every minute).
however this needs to be case for every value in the group_column
, so every minute can potentially have several values.
NOTE:
- the
group_column
can have hundreds of unique values. - some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in
group_column
.
example
Input:
dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0
desired output:
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
my attempt
I have done this, however it seems highly inefficient:
def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g
df_paded = pd.concat(group_resmaple(df, 'group_column'))
python pandas
I think you should changegroup_column
to be part of the index (df.set_index('group_column', inplace=True, drop=False)
), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
– John Zwinck
Nov 11 at 10:22
add a comment |
I have a pandas.DataFrame
df
with a pandas.DatetimeIndex
and a column named group_column
.
I need the df
to have a minutely frequency (meaning there is a row for every minute).
however this needs to be case for every value in the group_column
, so every minute can potentially have several values.
NOTE:
- the
group_column
can have hundreds of unique values. - some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in
group_column
.
example
Input:
dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0
desired output:
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
my attempt
I have done this, however it seems highly inefficient:
def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g
df_paded = pd.concat(group_resmaple(df, 'group_column'))
python pandas
I have a pandas.DataFrame
df
with a pandas.DatetimeIndex
and a column named group_column
.
I need the df
to have a minutely frequency (meaning there is a row for every minute).
however this needs to be case for every value in the group_column
, so every minute can potentially have several values.
NOTE:
- the
group_column
can have hundreds of unique values. - some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in
group_column
.
example
Input:
dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0
desired output:
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
my attempt
I have done this, however it seems highly inefficient:
def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g
df_paded = pd.concat(group_resmaple(df, 'group_column'))
python pandas
python pandas
asked Nov 11 at 10:19
moshevi
5281119
5281119
I think you should changegroup_column
to be part of the index (df.set_index('group_column', inplace=True, drop=False)
), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
– John Zwinck
Nov 11 at 10:22
add a comment |
I think you should changegroup_column
to be part of the index (df.set_index('group_column', inplace=True, drop=False)
), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
– John Zwinck
Nov 11 at 10:22
I think you should change
group_column
to be part of the index (df.set_index('group_column', inplace=True, drop=False)
), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…– John Zwinck
Nov 11 at 10:22
I think you should change
group_column
to be part of the index (df.set_index('group_column', inplace=True, drop=False)
), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…– John Zwinck
Nov 11 at 10:22
add a comment |
2 Answers
2
active
oldest
votes
Use GroupBy.apply
with asfreq
:
df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
add a comment |
My approach would be
df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)
data_column group_column
2018-01-01 12:00:00 1.2 a
2018-01-01 12:01:00 2.2 a
2018-01-01 12:02:00 2.2 a
2018-01-01 12:03:00 1.0 a
2018-01-01 12:01:00 4.0 b
2018-01-01 12:02:00 4.0 b
2018-01-01 12:03:00 4.0 b
2018-01-01 12:04:00 2.0 b
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he usesapply
which my approach does not, so it might be worth comparing performance...
– SpghttCd
Nov 11 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247754%2fpad-a-data-frame-according-to-a-frequency-for-each-group%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use GroupBy.apply
with asfreq
:
df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
add a comment |
Use GroupBy.apply
with asfreq
:
df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
add a comment |
Use GroupBy.apply
with asfreq
:
df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
Use GroupBy.apply
with asfreq
:
df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0
answered Nov 11 at 10:22
jezrael
319k22258336
319k22258336
add a comment |
add a comment |
My approach would be
df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)
data_column group_column
2018-01-01 12:00:00 1.2 a
2018-01-01 12:01:00 2.2 a
2018-01-01 12:02:00 2.2 a
2018-01-01 12:03:00 1.0 a
2018-01-01 12:01:00 4.0 b
2018-01-01 12:02:00 4.0 b
2018-01-01 12:03:00 4.0 b
2018-01-01 12:04:00 2.0 b
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he usesapply
which my approach does not, so it might be worth comparing performance...
– SpghttCd
Nov 11 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
add a comment |
My approach would be
df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)
data_column group_column
2018-01-01 12:00:00 1.2 a
2018-01-01 12:01:00 2.2 a
2018-01-01 12:02:00 2.2 a
2018-01-01 12:03:00 1.0 a
2018-01-01 12:01:00 4.0 b
2018-01-01 12:02:00 4.0 b
2018-01-01 12:03:00 4.0 b
2018-01-01 12:04:00 2.0 b
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he usesapply
which my approach does not, so it might be worth comparing performance...
– SpghttCd
Nov 11 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
add a comment |
My approach would be
df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)
data_column group_column
2018-01-01 12:00:00 1.2 a
2018-01-01 12:01:00 2.2 a
2018-01-01 12:02:00 2.2 a
2018-01-01 12:03:00 1.0 a
2018-01-01 12:01:00 4.0 b
2018-01-01 12:02:00 4.0 b
2018-01-01 12:03:00 4.0 b
2018-01-01 12:04:00 2.0 b
My approach would be
df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
print(df2)
data_column group_column
2018-01-01 12:00:00 1.2 a
2018-01-01 12:01:00 2.2 a
2018-01-01 12:02:00 2.2 a
2018-01-01 12:03:00 1.0 a
2018-01-01 12:01:00 4.0 b
2018-01-01 12:02:00 4.0 b
2018-01-01 12:03:00 4.0 b
2018-01-01 12:04:00 2.0 b
answered Nov 11 at 15:43
SpghttCd
4,0222313
4,0222313
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he usesapply
which my approach does not, so it might be worth comparing performance...
– SpghttCd
Nov 11 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
add a comment |
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he usesapply
which my approach does not, so it might be worth comparing performance...
– SpghttCd
Nov 11 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
why is this better then jezrael answer ? faster ?
– moshevi
Nov 11 at 16:38
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses
apply
which my approach does not, so it might be worth comparing performance...– SpghttCd
Nov 11 at 16:48
I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses
apply
which my approach does not, so it might be worth comparing performance...– SpghttCd
Nov 11 at 16:48
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
– SpghttCd
Nov 11 at 22:12
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247754%2fpad-a-data-frame-according-to-a-frequency-for-each-group%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I think you should change
group_column
to be part of the index (df.set_index('group_column', inplace=True, drop=False)
), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…– John Zwinck
Nov 11 at 10:22