pad a data frame according to a frequency for each group










1














I have a pandas.DataFrame df with a pandas.DatetimeIndex and a column named group_column.
I need the df to have a minutely frequency (meaning there is a row for every minute).



however this needs to be case for every value in the group_column, so every minute can potentially have several values.



NOTE:



  1. the group_column can have hundreds of unique values.

  2. some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in group_column.


example



Input:



dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)

group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0


desired output:



 group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0



my attempt



I have done this, however it seems highly inefficient:



def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g

df_paded = pd.concat(group_resmaple(df, 'group_column'))









share|improve this question





















  • I think you should change group_column to be part of the index (df.set_index('group_column', inplace=True, drop=False)), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
    – John Zwinck
    Nov 11 at 10:22
















1














I have a pandas.DataFrame df with a pandas.DatetimeIndex and a column named group_column.
I need the df to have a minutely frequency (meaning there is a row for every minute).



however this needs to be case for every value in the group_column, so every minute can potentially have several values.



NOTE:



  1. the group_column can have hundreds of unique values.

  2. some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in group_column.


example



Input:



dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)

group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0


desired output:



 group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0



my attempt



I have done this, however it seems highly inefficient:



def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g

df_paded = pd.concat(group_resmaple(df, 'group_column'))









share|improve this question





















  • I think you should change group_column to be part of the index (df.set_index('group_column', inplace=True, drop=False)), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
    – John Zwinck
    Nov 11 at 10:22














1












1








1







I have a pandas.DataFrame df with a pandas.DatetimeIndex and a column named group_column.
I need the df to have a minutely frequency (meaning there is a row for every minute).



however this needs to be case for every value in the group_column, so every minute can potentially have several values.



NOTE:



  1. the group_column can have hundreds of unique values.

  2. some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in group_column.


example



Input:



dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)

group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0


desired output:



 group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0



my attempt



I have done this, however it seems highly inefficient:



def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g

df_paded = pd.concat(group_resmaple(df, 'group_column'))









share|improve this question













I have a pandas.DataFrame df with a pandas.DatetimeIndex and a column named group_column.
I need the df to have a minutely frequency (meaning there is a row for every minute).



however this needs to be case for every value in the group_column, so every minute can potentially have several values.



NOTE:



  1. the group_column can have hundreds of unique values.

  2. some groups can "last" several minutes and others can last for days, the edges are determined by the first and last appearances of the values in group_column.


example



Input:



dates = [pd.Timestamp('2018-01-01 12:00'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:01'), pd.Timestamp('2018-01-01 12:03'), pd.Timestamp('2018-01-01 12:04')]
df = pd.DataFrame('group_column': ['a', 'a','b','a','b'], 'data_column': [1.2, 2.2, 4, 1, 2], index=dates)

group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:01:00 b 4.0
2018-01-01 12:03:00 a 1.0
2018-01-01 12:04:00 b 2.0


desired output:



 group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0



my attempt



I have done this, however it seems highly inefficient:



def group_resmaple(df, group_column_name):
values = df[group_column_name].unique()
for value in values:
df_g = df.loc[df[group_column]==value]
df_g = df_g.asfreq('min', 'pad')
yield df_g

df_paded = pd.concat(group_resmaple(df, 'group_column'))






python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 11 at 10:19









moshevi

5281119




5281119











  • I think you should change group_column to be part of the index (df.set_index('group_column', inplace=True, drop=False)), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
    – John Zwinck
    Nov 11 at 10:22

















  • I think you should change group_column to be part of the index (df.set_index('group_column', inplace=True, drop=False)), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
    – John Zwinck
    Nov 11 at 10:22
















I think you should change group_column to be part of the index (df.set_index('group_column', inplace=True, drop=False)), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
– John Zwinck
Nov 11 at 10:22





I think you should change group_column to be part of the index (df.set_index('group_column', inplace=True, drop=False)), then try to figure out how to make resampling work, either (ideally) on the whole DataFrame, or perhaps on one group at a time. See stackoverflow.com/questions/15799162/…
– John Zwinck
Nov 11 at 10:22













2 Answers
2






active

oldest

votes


















1














Use GroupBy.apply with asfreq:



df1 = (df.groupby('group_column')
.apply(lambda x: x.asfreq('min', 'pad'))
.reset_index(level=0, drop=True))
print (df1)
group_column data_column
2018-01-01 12:00:00 a 1.2
2018-01-01 12:01:00 a 2.2
2018-01-01 12:02:00 a 2.2
2018-01-01 12:03:00 a 1.0
2018-01-01 12:01:00 b 4.0
2018-01-01 12:02:00 b 4.0
2018-01-01 12:03:00 b 4.0
2018-01-01 12:04:00 b 2.0





share|improve this answer




























    1














    My approach would be



    df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
    print(df2)

    data_column group_column
    2018-01-01 12:00:00 1.2 a
    2018-01-01 12:01:00 2.2 a
    2018-01-01 12:02:00 2.2 a
    2018-01-01 12:03:00 1.0 a
    2018-01-01 12:01:00 4.0 b
    2018-01-01 12:02:00 4.0 b
    2018-01-01 12:03:00 4.0 b
    2018-01-01 12:04:00 2.0 b





    share|improve this answer




















    • why is this better then jezrael answer ? faster ?
      – moshevi
      Nov 11 at 16:38











    • I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
      – SpghttCd
      Nov 11 at 16:48










    • In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
      – SpghttCd
      Nov 11 at 22:12










    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247754%2fpad-a-data-frame-according-to-a-frequency-for-each-group%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Use GroupBy.apply with asfreq:



    df1 = (df.groupby('group_column')
    .apply(lambda x: x.asfreq('min', 'pad'))
    .reset_index(level=0, drop=True))
    print (df1)
    group_column data_column
    2018-01-01 12:00:00 a 1.2
    2018-01-01 12:01:00 a 2.2
    2018-01-01 12:02:00 a 2.2
    2018-01-01 12:03:00 a 1.0
    2018-01-01 12:01:00 b 4.0
    2018-01-01 12:02:00 b 4.0
    2018-01-01 12:03:00 b 4.0
    2018-01-01 12:04:00 b 2.0





    share|improve this answer

























      1














      Use GroupBy.apply with asfreq:



      df1 = (df.groupby('group_column')
      .apply(lambda x: x.asfreq('min', 'pad'))
      .reset_index(level=0, drop=True))
      print (df1)
      group_column data_column
      2018-01-01 12:00:00 a 1.2
      2018-01-01 12:01:00 a 2.2
      2018-01-01 12:02:00 a 2.2
      2018-01-01 12:03:00 a 1.0
      2018-01-01 12:01:00 b 4.0
      2018-01-01 12:02:00 b 4.0
      2018-01-01 12:03:00 b 4.0
      2018-01-01 12:04:00 b 2.0





      share|improve this answer























        1












        1








        1






        Use GroupBy.apply with asfreq:



        df1 = (df.groupby('group_column')
        .apply(lambda x: x.asfreq('min', 'pad'))
        .reset_index(level=0, drop=True))
        print (df1)
        group_column data_column
        2018-01-01 12:00:00 a 1.2
        2018-01-01 12:01:00 a 2.2
        2018-01-01 12:02:00 a 2.2
        2018-01-01 12:03:00 a 1.0
        2018-01-01 12:01:00 b 4.0
        2018-01-01 12:02:00 b 4.0
        2018-01-01 12:03:00 b 4.0
        2018-01-01 12:04:00 b 2.0





        share|improve this answer












        Use GroupBy.apply with asfreq:



        df1 = (df.groupby('group_column')
        .apply(lambda x: x.asfreq('min', 'pad'))
        .reset_index(level=0, drop=True))
        print (df1)
        group_column data_column
        2018-01-01 12:00:00 a 1.2
        2018-01-01 12:01:00 a 2.2
        2018-01-01 12:02:00 a 2.2
        2018-01-01 12:03:00 a 1.0
        2018-01-01 12:01:00 b 4.0
        2018-01-01 12:02:00 b 4.0
        2018-01-01 12:03:00 b 4.0
        2018-01-01 12:04:00 b 2.0






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 11 at 10:22









        jezrael

        319k22258336




        319k22258336























            1














            My approach would be



            df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
            print(df2)

            data_column group_column
            2018-01-01 12:00:00 1.2 a
            2018-01-01 12:01:00 2.2 a
            2018-01-01 12:02:00 2.2 a
            2018-01-01 12:03:00 1.0 a
            2018-01-01 12:01:00 4.0 b
            2018-01-01 12:02:00 4.0 b
            2018-01-01 12:03:00 4.0 b
            2018-01-01 12:04:00 2.0 b





            share|improve this answer




















            • why is this better then jezrael answer ? faster ?
              – moshevi
              Nov 11 at 16:38











            • I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
              – SpghttCd
              Nov 11 at 16:48










            • In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
              – SpghttCd
              Nov 11 at 22:12















            1














            My approach would be



            df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
            print(df2)

            data_column group_column
            2018-01-01 12:00:00 1.2 a
            2018-01-01 12:01:00 2.2 a
            2018-01-01 12:02:00 2.2 a
            2018-01-01 12:03:00 1.0 a
            2018-01-01 12:01:00 4.0 b
            2018-01-01 12:02:00 4.0 b
            2018-01-01 12:03:00 4.0 b
            2018-01-01 12:04:00 2.0 b





            share|improve this answer




















            • why is this better then jezrael answer ? faster ?
              – moshevi
              Nov 11 at 16:38











            • I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
              – SpghttCd
              Nov 11 at 16:48










            • In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
              – SpghttCd
              Nov 11 at 22:12













            1












            1








            1






            My approach would be



            df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
            print(df2)

            data_column group_column
            2018-01-01 12:00:00 1.2 a
            2018-01-01 12:01:00 2.2 a
            2018-01-01 12:02:00 2.2 a
            2018-01-01 12:03:00 1.0 a
            2018-01-01 12:01:00 4.0 b
            2018-01-01 12:02:00 4.0 b
            2018-01-01 12:03:00 4.0 b
            2018-01-01 12:04:00 2.0 b





            share|improve this answer












            My approach would be



            df2 = df.groupby('group_column').resample('min').ffill().reset_index(level=0, drop=True)
            print(df2)

            data_column group_column
            2018-01-01 12:00:00 1.2 a
            2018-01-01 12:01:00 2.2 a
            2018-01-01 12:02:00 2.2 a
            2018-01-01 12:03:00 1.0 a
            2018-01-01 12:01:00 4.0 b
            2018-01-01 12:02:00 4.0 b
            2018-01-01 12:03:00 4.0 b
            2018-01-01 12:04:00 2.0 b






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 11 at 15:43









            SpghttCd

            4,0222313




            4,0222313











            • why is this better then jezrael answer ? faster ?
              – moshevi
              Nov 11 at 16:38











            • I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
              – SpghttCd
              Nov 11 at 16:48










            • In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
              – SpghttCd
              Nov 11 at 22:12
















            • why is this better then jezrael answer ? faster ?
              – moshevi
              Nov 11 at 16:38











            • I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
              – SpghttCd
              Nov 11 at 16:48










            • In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
              – SpghttCd
              Nov 11 at 22:12















            why is this better then jezrael answer ? faster ?
            – moshevi
            Nov 11 at 16:38





            why is this better then jezrael answer ? faster ?
            – moshevi
            Nov 11 at 16:38













            I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
            – SpghttCd
            Nov 11 at 16:48




            I didn't mean to have a better solution, it's just mine in the first place. I didn't time it and normally from my experience here it's generally a good idea to examine whatever @jezrael is posting, at least with regards to pandas. However, in this case he uses apply which my approach does not, so it might be worth comparing performance...
            – SpghttCd
            Nov 11 at 16:48












            In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
            – SpghttCd
            Nov 11 at 22:12




            In the meantime I was able to timeit and the code of @jezrael is about 25% faster. I always thought "no apply is better", but obviously this is wrong. Now why should perhaps better asked to jezrael...
            – SpghttCd
            Nov 11 at 22:12

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53247754%2fpad-a-data-frame-according-to-a-frequency-for-each-group%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Use pre created SQLite database for Android project in kotlin

            Darth Vader #20

            Ondo