Pandas dataframe split string into multiple columns with conditions and missing data

So I have a DataFrame which looks like this:

df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
 feature1 feature2 unparsed_features
 0 34 1 neoclassical, heavy, $2, old, bronze
 1 45 0 romanticism, gold, $5
 2 52 1 baroque, xs, $3, new

I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.

I have a general idea of what each column can possibly be as shown below:

main_dict = 
 'weight': ['heavy','light'],
 'age': ['new','old'],
 'colour': ['gold','silver','bronze'],
 'size': ['xs','s','m','l','xl','xxl','xxxl'],
 'price': ['$'],
 'period': ['renaissance','baroque','rococo','neoclassical','romanticism']

Ideally I would like my Dataframe to look like the following:

df
 feature1 feature2 unparsed_features weight price age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old 
1 45 0 romanticism, gold, $5 $5 
2 52 1 baroque, xs, $3, new $3 new 

 size colour period 
0 bronze neoclassical 
1 gold romanticism 
2 xs baroque

I know the first step would be to split the string by comma but I am lost after that.

df['unparsed_features'].str.split(',')

Thank you for your help.

asked Nov 13 '18 at 14:56

user3374113

128416

You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11

Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27

1

Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28

add a comment |

So I have a DataFrame which looks like this:

df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
 feature1 feature2 unparsed_features
 0 34 1 neoclassical, heavy, $2, old, bronze
 1 45 0 romanticism, gold, $5
 2 52 1 baroque, xs, $3, new

I have a general idea of what each column can possibly be as shown below:

main_dict = 
 'weight': ['heavy','light'],
 'age': ['new','old'],
 'colour': ['gold','silver','bronze'],
 'size': ['xs','s','m','l','xl','xxl','xxxl'],
 'price': ['$'],
 'period': ['renaissance','baroque','rococo','neoclassical','romanticism']

Ideally I would like my Dataframe to look like the following:

df
 feature1 feature2 unparsed_features weight price age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old 
1 45 0 romanticism, gold, $5 $5 
2 52 1 baroque, xs, $3, new $3 new 

 size colour period 
0 bronze neoclassical 
1 gold romanticism 
2 xs baroque

I know the first step would be to split the string by comma but I am lost after that.

df['unparsed_features'].str.split(',')

Thank you for your help.

asked Nov 13 '18 at 14:56

user3374113

128416

You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11

Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27

1

Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28

add a comment |

So I have a DataFrame which looks like this:

df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
 feature1 feature2 unparsed_features
 0 34 1 neoclassical, heavy, $2, old, bronze
 1 45 0 romanticism, gold, $5
 2 52 1 baroque, xs, $3, new

I have a general idea of what each column can possibly be as shown below:

main_dict = 
 'weight': ['heavy','light'],
 'age': ['new','old'],
 'colour': ['gold','silver','bronze'],
 'size': ['xs','s','m','l','xl','xxl','xxxl'],
 'price': ['$'],
 'period': ['renaissance','baroque','rococo','neoclassical','romanticism']

Ideally I would like my Dataframe to look like the following:

df
 feature1 feature2 unparsed_features weight price age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old 
1 45 0 romanticism, gold, $5 $5 
2 52 1 baroque, xs, $3, new $3 new 

 size colour period 
0 bronze neoclassical 
1 gold romanticism 
2 xs baroque

I know the first step would be to split the string by comma but I am lost after that.

df['unparsed_features'].str.split(',')

Thank you for your help.

asked Nov 13 '18 at 14:56

user3374113

128416

So I have a DataFrame which looks like this:

df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])

df
 feature1 feature2 unparsed_features
 0 34 1 neoclassical, heavy, $2, old, bronze
 1 45 0 romanticism, gold, $5
 2 52 1 baroque, xs, $3, new

I have a general idea of what each column can possibly be as shown below:

main_dict = 
 'weight': ['heavy','light'],
 'age': ['new','old'],
 'colour': ['gold','silver','bronze'],
 'size': ['xs','s','m','l','xl','xxl','xxxl'],
 'price': ['$'],
 'period': ['renaissance','baroque','rococo','neoclassical','romanticism']

Ideally I would like my Dataframe to look like the following:

df
 feature1 feature2 unparsed_features weight price age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old 
1 45 0 romanticism, gold, $5 $5 
2 52 1 baroque, xs, $3, new $3 new 

 size colour period 
0 bronze neoclassical 
1 gold romanticism 
2 xs baroque

I know the first step would be to split the string by comma but I am lost after that.

df['unparsed_features'].str.split(',')

Thank you for your help.

python string pandas dataframe

asked Nov 13 '18 at 14:56

user3374113

128416

asked Nov 13 '18 at 14:56

user3374113

128416

asked Nov 13 '18 at 14:56

user3374113

128416

asked Nov 13 '18 at 14:56

user3374113

128416

asked Nov 13 '18 at 14:56

user3374113

128416

You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11

Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27

1

Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28

add a comment |

You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11

Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27

1

Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28

You may need create you own dict , for those certain item belong to which field

– Wen-Ben
Nov 13 '18 at 15:11

Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit): df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.

– Evan
Nov 13 '18 at 15:27

Possible duplicate of pandas: How do I split text in a column into multiple rows?

– Evan
Nov 13 '18 at 15:28

add a comment |

2 Answers
2

active

oldest

votes

Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:

for key, list_item in main_dict.items():
 if key =='price':
 df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
 else:
 df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')

$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.

And you get as expected:

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 
2 52 1 baroque, xs, $3, new new 

 colour size price period 
0 bronze $2 neoclassical 
1 gold $5 romanticism 
2 xs $3 baroque

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

add a comment |

Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach

for keys in main_dict:
 data_list = 
 for value in df.unparsed_features: # for every row
 for l_data in main_dict[keys]:
 if keys == 'price':
 matching = [v for v in value.split(',') if l_data in v]
 else:
 matching = [v for v in value.split(',') if l_data == v.strip()]

 if matching:
 break

 if matching:
 data_list.append(matching[0])
 else:
 data_list.append(None)

 matching = '' 
 df[keys] = data_list

Output

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 None None 
2 52 1 baroque, xs, $3, new None new 

 colour size price period 
0 bronze None $2 neoclassical 
1 gold None $5 romanticism 
2 None xs $3 baroque

answered Nov 13 '18 at 15:42

iamklaus

1,110149

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283760%2fpandas-dataframe-split-string-into-multiple-columns-with-conditions-and-missing%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

for key, list_item in main_dict.items():
 if key =='price':
 df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
 else:
 df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')

$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.

And you get as expected:

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 
2 52 1 baroque, xs, $3, new new 

 colour size price period 
0 bronze $2 neoclassical 
1 gold $5 romanticism 
2 xs $3 baroque

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

add a comment |

for key, list_item in main_dict.items():
 if key =='price':
 df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
 else:
 df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')

$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.

And you get as expected:

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 
2 52 1 baroque, xs, $3, new new 

 colour size price period 
0 bronze $2 neoclassical 
1 gold $5 romanticism 
2 xs $3 baroque

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

add a comment |

for key, list_item in main_dict.items():
 if key =='price':
 df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
 else:
 df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')

$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.

And you get as expected:

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 
2 52 1 baroque, xs, $3, new new 

 colour size price period 
0 bronze $2 neoclassical 
1 gold $5 romanticism 
2 xs $3 baroque

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

for key, list_item in main_dict.items():
 if key =='price':
 df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
 else:
 df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')

$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.

And you get as expected:

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 
2 52 1 baroque, xs, $3, new new 

 colour size price period 
0 bronze $2 neoclassical 
1 gold $5 romanticism 
2 xs $3 baroque

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

answered Nov 13 '18 at 15:40

Ben.T

6,1002825

add a comment |

Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach

for keys in main_dict:
 data_list = 
 for value in df.unparsed_features: # for every row
 for l_data in main_dict[keys]:
 if keys == 'price':
 matching = [v for v in value.split(',') if l_data in v]
 else:
 matching = [v for v in value.split(',') if l_data == v.strip()]

 if matching:
 break

 if matching:
 data_list.append(matching[0])
 else:
 data_list.append(None)

 matching = '' 
 df[keys] = data_list

Output

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 None None 
2 52 1 baroque, xs, $3, new None new 

 colour size price period 
0 bronze None $2 neoclassical 
1 gold None $5 romanticism 
2 None xs $3 baroque

answered Nov 13 '18 at 15:42

iamklaus

1,110149

add a comment |

Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach

for keys in main_dict:
 data_list = 
 for value in df.unparsed_features: # for every row
 for l_data in main_dict[keys]:
 if keys == 'price':
 matching = [v for v in value.split(',') if l_data in v]
 else:
 matching = [v for v in value.split(',') if l_data == v.strip()]

 if matching:
 break

 if matching:
 data_list.append(matching[0])
 else:
 data_list.append(None)

 matching = '' 
 df[keys] = data_list

Output

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 None None 
2 52 1 baroque, xs, $3, new None new 

 colour size price period 
0 bronze None $2 neoclassical 
1 gold None $5 romanticism 
2 None xs $3 baroque

answered Nov 13 '18 at 15:42

iamklaus

1,110149

add a comment |

Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach

for keys in main_dict:
 data_list = 
 for value in df.unparsed_features: # for every row
 for l_data in main_dict[keys]:
 if keys == 'price':
 matching = [v for v in value.split(',') if l_data in v]
 else:
 matching = [v for v in value.split(',') if l_data == v.strip()]

 if matching:
 break

 if matching:
 data_list.append(matching[0])
 else:
 data_list.append(None)

 matching = '' 
 df[keys] = data_list

Output

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 None None 
2 52 1 baroque, xs, $3, new None new 

 colour size price period 
0 bronze None $2 neoclassical 
1 gold None $5 romanticism 
2 None xs $3 baroque

answered Nov 13 '18 at 15:42

iamklaus

1,110149

Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach

for keys in main_dict:
 data_list = 
 for value in df.unparsed_features: # for every row
 for l_data in main_dict[keys]:
 if keys == 'price':
 matching = [v for v in value.split(',') if l_data in v]
 else:
 matching = [v for v in value.split(',') if l_data == v.strip()]

 if matching:
 break

 if matching:
 data_list.append(matching[0])
 else:
 data_list.append(None)

 matching = '' 
 df[keys] = data_list

Output

 feature1 feature2 unparsed_features weight age 
0 34 1 neoclassical, heavy, $2, old, bronze heavy old 
1 45 0 romanticism, gold, $5 None None 
2 52 1 baroque, xs, $3, new None new 

 colour size price period 
0 bronze None $2 neoclassical 
1 gold None $5 romanticism 
2 None xs $3 baroque

answered Nov 13 '18 at 15:42

iamklaus

1,110149

answered Nov 13 '18 at 15:42

iamklaus

1,110149

answered Nov 13 '18 at 15:42

iamklaus

1,110149

answered Nov 13 '18 at 15:42

iamklaus

1,110149

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb