Pandas dataframe split string into multiple columns with conditions and missing data










0















So I have a DataFrame which looks like this:



df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new


I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.



I have a general idea of what each column can possibly be as shown below:



main_dict = 
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']



Ideally I would like my Dataframe to look like the following:



df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new

size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque


I know the first step would be to split the string by comma but I am lost after that.



df['unparsed_features'].str.split(',')



Thank you for your help.










share|improve this question






















  • You may need create you own dict , for those certain item belong to which field

    – Wen-Ben
    Nov 13 '18 at 15:11











  • Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

    – Evan
    Nov 13 '18 at 15:27







  • 1





    Possible duplicate of pandas: How do I split text in a column into multiple rows?

    – Evan
    Nov 13 '18 at 15:28















0















So I have a DataFrame which looks like this:



df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new


I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.



I have a general idea of what each column can possibly be as shown below:



main_dict = 
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']



Ideally I would like my Dataframe to look like the following:



df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new

size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque


I know the first step would be to split the string by comma but I am lost after that.



df['unparsed_features'].str.split(',')



Thank you for your help.










share|improve this question






















  • You may need create you own dict , for those certain item belong to which field

    – Wen-Ben
    Nov 13 '18 at 15:11











  • Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

    – Evan
    Nov 13 '18 at 15:27







  • 1





    Possible duplicate of pandas: How do I split text in a column into multiple rows?

    – Evan
    Nov 13 '18 at 15:28













0












0








0








So I have a DataFrame which looks like this:



df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new


I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.



I have a general idea of what each column can possibly be as shown below:



main_dict = 
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']



Ideally I would like my Dataframe to look like the following:



df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new

size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque


I know the first step would be to split the string by comma but I am lost after that.



df['unparsed_features'].str.split(',')



Thank you for your help.










share|improve this question














So I have a DataFrame which looks like this:



df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new


I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.



I have a general idea of what each column can possibly be as shown below:



main_dict = 
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']



Ideally I would like my Dataframe to look like the following:



df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new

size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque


I know the first step would be to split the string by comma but I am lost after that.



df['unparsed_features'].str.split(',')



Thank you for your help.







python string pandas dataframe






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 14:56









user3374113user3374113

128416




128416












  • You may need create you own dict , for those certain item belong to which field

    – Wen-Ben
    Nov 13 '18 at 15:11











  • Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

    – Evan
    Nov 13 '18 at 15:27







  • 1





    Possible duplicate of pandas: How do I split text in a column into multiple rows?

    – Evan
    Nov 13 '18 at 15:28

















  • You may need create you own dict , for those certain item belong to which field

    – Wen-Ben
    Nov 13 '18 at 15:11











  • Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

    – Evan
    Nov 13 '18 at 15:27







  • 1





    Possible duplicate of pandas: How do I split text in a column into multiple rows?

    – Evan
    Nov 13 '18 at 15:28
















You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11





You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11













Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27






Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27





1




1





Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28





Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28












2 Answers
2






active

oldest

votes


















0














Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:



for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')


$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.



And you get as expected:



 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new

colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque





share|improve this answer






























    0














    Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach



    for keys in main_dict:
    data_list =
    for value in df.unparsed_features: # for every row
    for l_data in main_dict[keys]:
    if keys == 'price':
    matching = [v for v in value.split(',') if l_data in v]
    else:
    matching = [v for v in value.split(',') if l_data == v.strip()]

    if matching:
    break

    if matching:
    data_list.append(matching[0])
    else:
    data_list.append(None)

    matching = ''
    df[keys] = data_list


    Output



     feature1 feature2 unparsed_features weight age 
    0 34 1 neoclassical, heavy, $2, old, bronze heavy old
    1 45 0 romanticism, gold, $5 None None
    2 52 1 baroque, xs, $3, new None new

    colour size price period
    0 bronze None $2 neoclassical
    1 gold None $5 romanticism
    2 None xs $3 baroque





    share|improve this answer






















      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283760%2fpandas-dataframe-split-string-into-multiple-columns-with-conditions-and-missing%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0














      Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:



      for key, list_item in main_dict.items():
      if key =='price':
      df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
      else:
      df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')


      $d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.



      And you get as expected:



       feature1 feature2 unparsed_features weight age 
      0 34 1 neoclassical, heavy, $2, old, bronze heavy old
      1 45 0 romanticism, gold, $5
      2 52 1 baroque, xs, $3, new new

      colour size price period
      0 bronze $2 neoclassical
      1 gold $5 romanticism
      2 xs $3 baroque





      share|improve this answer



























        0














        Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:



        for key, list_item in main_dict.items():
        if key =='price':
        df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
        else:
        df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')


        $d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.



        And you get as expected:



         feature1 feature2 unparsed_features weight age 
        0 34 1 neoclassical, heavy, $2, old, bronze heavy old
        1 45 0 romanticism, gold, $5
        2 52 1 baroque, xs, $3, new new

        colour size price period
        0 bronze $2 neoclassical
        1 gold $5 romanticism
        2 xs $3 baroque





        share|improve this answer

























          0












          0








          0







          Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:



          for key, list_item in main_dict.items():
          if key =='price':
          df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
          else:
          df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')


          $d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.



          And you get as expected:



           feature1 feature2 unparsed_features weight age 
          0 34 1 neoclassical, heavy, $2, old, bronze heavy old
          1 45 0 romanticism, gold, $5
          2 52 1 baroque, xs, $3, new new

          colour size price period
          0 bronze $2 neoclassical
          1 gold $5 romanticism
          2 xs $3 baroque





          share|improve this answer













          Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:



          for key, list_item in main_dict.items():
          if key =='price':
          df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
          else:
          df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')


          $d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.



          And you get as expected:



           feature1 feature2 unparsed_features weight age 
          0 34 1 neoclassical, heavy, $2, old, bronze heavy old
          1 45 0 romanticism, gold, $5
          2 52 1 baroque, xs, $3, new new

          colour size price period
          0 bronze $2 neoclassical
          1 gold $5 romanticism
          2 xs $3 baroque






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 13 '18 at 15:40









          Ben.TBen.T

          6,1002825




          6,1002825























              0














              Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach



              for keys in main_dict:
              data_list =
              for value in df.unparsed_features: # for every row
              for l_data in main_dict[keys]:
              if keys == 'price':
              matching = [v for v in value.split(',') if l_data in v]
              else:
              matching = [v for v in value.split(',') if l_data == v.strip()]

              if matching:
              break

              if matching:
              data_list.append(matching[0])
              else:
              data_list.append(None)

              matching = ''
              df[keys] = data_list


              Output



               feature1 feature2 unparsed_features weight age 
              0 34 1 neoclassical, heavy, $2, old, bronze heavy old
              1 45 0 romanticism, gold, $5 None None
              2 52 1 baroque, xs, $3, new None new

              colour size price period
              0 bronze None $2 neoclassical
              1 gold None $5 romanticism
              2 None xs $3 baroque





              share|improve this answer



























                0














                Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach



                for keys in main_dict:
                data_list =
                for value in df.unparsed_features: # for every row
                for l_data in main_dict[keys]:
                if keys == 'price':
                matching = [v for v in value.split(',') if l_data in v]
                else:
                matching = [v for v in value.split(',') if l_data == v.strip()]

                if matching:
                break

                if matching:
                data_list.append(matching[0])
                else:
                data_list.append(None)

                matching = ''
                df[keys] = data_list


                Output



                 feature1 feature2 unparsed_features weight age 
                0 34 1 neoclassical, heavy, $2, old, bronze heavy old
                1 45 0 romanticism, gold, $5 None None
                2 52 1 baroque, xs, $3, new None new

                colour size price period
                0 bronze None $2 neoclassical
                1 gold None $5 romanticism
                2 None xs $3 baroque





                share|improve this answer

























                  0












                  0








                  0







                  Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach



                  for keys in main_dict:
                  data_list =
                  for value in df.unparsed_features: # for every row
                  for l_data in main_dict[keys]:
                  if keys == 'price':
                  matching = [v for v in value.split(',') if l_data in v]
                  else:
                  matching = [v for v in value.split(',') if l_data == v.strip()]

                  if matching:
                  break

                  if matching:
                  data_list.append(matching[0])
                  else:
                  data_list.append(None)

                  matching = ''
                  df[keys] = data_list


                  Output



                   feature1 feature2 unparsed_features weight age 
                  0 34 1 neoclassical, heavy, $2, old, bronze heavy old
                  1 45 0 romanticism, gold, $5 None None
                  2 52 1 baroque, xs, $3, new None new

                  colour size price period
                  0 bronze None $2 neoclassical
                  1 gold None $5 romanticism
                  2 None xs $3 baroque





                  share|improve this answer













                  Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach



                  for keys in main_dict:
                  data_list =
                  for value in df.unparsed_features: # for every row
                  for l_data in main_dict[keys]:
                  if keys == 'price':
                  matching = [v for v in value.split(',') if l_data in v]
                  else:
                  matching = [v for v in value.split(',') if l_data == v.strip()]

                  if matching:
                  break

                  if matching:
                  data_list.append(matching[0])
                  else:
                  data_list.append(None)

                  matching = ''
                  df[keys] = data_list


                  Output



                   feature1 feature2 unparsed_features weight age 
                  0 34 1 neoclassical, heavy, $2, old, bronze heavy old
                  1 45 0 romanticism, gold, $5 None None
                  2 52 1 baroque, xs, $3, new None new

                  colour size price period
                  0 bronze None $2 neoclassical
                  1 gold None $5 romanticism
                  2 None xs $3 baroque






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 13 '18 at 15:42









                  iamklausiamklaus

                  1,110149




                  1,110149



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283760%2fpandas-dataframe-split-string-into-multiple-columns-with-conditions-and-missing%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Kleinkühnau

                      Makov (Slowakei)

                      Deutsches Schauspielhaus