Pandas dataframe split string into multiple columns with conditions and missing data
So I have a DataFrame which looks like this:
df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])
df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new
I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.
I have a general idea of what each column can possibly be as shown below:
main_dict =
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
Ideally I would like my Dataframe to look like the following:
df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new
size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque
I know the first step would be to split the string by comma but I am lost after that.
df['unparsed_features'].str.split(',')
Thank you for your help.
python string pandas dataframe
add a comment |
So I have a DataFrame which looks like this:
df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])
df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new
I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.
I have a general idea of what each column can possibly be as shown below:
main_dict =
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
Ideally I would like my Dataframe to look like the following:
df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new
size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque
I know the first step would be to split the string by comma but I am lost after that.
df['unparsed_features'].str.split(',')
Thank you for your help.
python string pandas dataframe
You may need create you own dict , for those certain item belong to which field
– Wen-Ben
Nov 13 '18 at 15:11
Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit):df.unparsed_features.str.split(",", expand=True)Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.
– Evan
Nov 13 '18 at 15:27
1
Possible duplicate of pandas: How do I split text in a column into multiple rows?
– Evan
Nov 13 '18 at 15:28
add a comment |
So I have a DataFrame which looks like this:
df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])
df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new
I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.
I have a general idea of what each column can possibly be as shown below:
main_dict =
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
Ideally I would like my Dataframe to look like the following:
df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new
size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque
I know the first step would be to split the string by comma but I am lost after that.
df['unparsed_features'].str.split(',')
Thank you for your help.
python string pandas dataframe
So I have a DataFrame which looks like this:
df = pd.DataFrame('feature1':[34,45,52],'feature2':[1,0,1],'unparsed_features':["neoclassical, heavy, $2, old, bronze", "romanticism, gold, $5", "baroque, xs, $3, new"])
df
feature1 feature2 unparsed_features
0 34 1 neoclassical, heavy, $2, old, bronze
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new
I am trying to split the column unparsed_features into 6 columns (weight, age, colour, size, price and period) but as you can see the order is jumbled up and not only that, some fields are missing too.
I have a general idea of what each column can possibly be as shown below:
main_dict =
'weight': ['heavy','light'],
'age': ['new','old'],
'colour': ['gold','silver','bronze'],
'size': ['xs','s','m','l','xl','xxl','xxxl'],
'price': ['$'],
'period': ['renaissance','baroque','rococo','neoclassical','romanticism']
Ideally I would like my Dataframe to look like the following:
df
feature1 feature2 unparsed_features weight price age
0 34 1 neoclassical, heavy, $2, old, bronze heavy $2 old
1 45 0 romanticism, gold, $5 $5
2 52 1 baroque, xs, $3, new $3 new
size colour period
0 bronze neoclassical
1 gold romanticism
2 xs baroque
I know the first step would be to split the string by comma but I am lost after that.
df['unparsed_features'].str.split(',')
Thank you for your help.
python string pandas dataframe
python string pandas dataframe
asked Nov 13 '18 at 14:56
user3374113user3374113
128416
128416
You may need create you own dict , for those certain item belong to which field
– Wen-Ben
Nov 13 '18 at 15:11
Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit):df.unparsed_features.str.split(",", expand=True)Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.
– Evan
Nov 13 '18 at 15:27
1
Possible duplicate of pandas: How do I split text in a column into multiple rows?
– Evan
Nov 13 '18 at 15:28
add a comment |
You may need create you own dict , for those certain item belong to which field
– Wen-Ben
Nov 13 '18 at 15:11
Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit):df.unparsed_features.str.split(",", expand=True)Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.
– Evan
Nov 13 '18 at 15:27
1
Possible duplicate of pandas: How do I split text in a column into multiple rows?
– Evan
Nov 13 '18 at 15:28
You may need create you own dict , for those certain item belong to which field
– Wen-Ben
Nov 13 '18 at 15:11
You may need create you own dict , for those certain item belong to which field
– Wen-Ben
Nov 13 '18 at 15:11
Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit):
df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.– Evan
Nov 13 '18 at 15:27
Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit):
df.unparsed_features.str.split(",", expand=True) Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.– Evan
Nov 13 '18 at 15:27
1
1
Possible duplicate of pandas: How do I split text in a column into multiple rows?
– Evan
Nov 13 '18 at 15:28
Possible duplicate of pandas: How do I split text in a column into multiple rows?
– Evan
Nov 13 '18 at 15:28
add a comment |
2 Answers
2
active
oldest
votes
Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:
for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')
$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.
And you get as expected:
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new
colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque
add a comment |
Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach
for keys in main_dict:
data_list =
for value in df.unparsed_features: # for every row
for l_data in main_dict[keys]:
if keys == 'price':
matching = [v for v in value.split(',') if l_data in v]
else:
matching = [v for v in value.split(',') if l_data == v.strip()]
if matching:
break
if matching:
data_list.append(matching[0])
else:
data_list.append(None)
matching = ''
df[keys] = data_list
Output
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5 None None
2 52 1 baroque, xs, $3, new None new
colour size price period
0 bronze None $2 neoclassical
1 gold None $5 romanticism
2 None xs $3 baroque
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283760%2fpandas-dataframe-split-string-into-multiple-columns-with-conditions-and-missing%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:
for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')
$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.
And you get as expected:
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new
colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque
add a comment |
Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:
for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')
$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.
And you get as expected:
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new
colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque
add a comment |
Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:
for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')
$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.
And you get as expected:
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new
colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque
Not sure there is an easy way to do it as the data in 'unparsed_features' does not have the same structure in each row. One way could be to use the dictionary main_dict you defined, loop over each item and use str.extract with the parameter pat a bit different for price:
for key, list_item in main_dict.items():
if key =='price':
df[key] = df.unparsed_features.str.extract('($d+)').fillna('')
else:
df[key] = df.unparsed_features.str.extract('((^|W)' +'|(^|W)'.join(list_item) + ')').fillna('')
$d+ allows to look for any digit after the symbol $ and (^|W) look for a space or the beginning of the line before any word in list_item.
And you get as expected:
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5
2 52 1 baroque, xs, $3, new new
colour size price period
0 bronze $2 neoclassical
1 gold $5 romanticism
2 xs $3 baroque
answered Nov 13 '18 at 15:40
Ben.TBen.T
6,1002825
6,1002825
add a comment |
add a comment |
Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach
for keys in main_dict:
data_list =
for value in df.unparsed_features: # for every row
for l_data in main_dict[keys]:
if keys == 'price':
matching = [v for v in value.split(',') if l_data in v]
else:
matching = [v for v in value.split(',') if l_data == v.strip()]
if matching:
break
if matching:
data_list.append(matching[0])
else:
data_list.append(None)
matching = ''
df[keys] = data_list
Output
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5 None None
2 52 1 baroque, xs, $3, new None new
colour size price period
0 bronze None $2 neoclassical
1 gold None $5 romanticism
2 None xs $3 baroque
add a comment |
Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach
for keys in main_dict:
data_list =
for value in df.unparsed_features: # for every row
for l_data in main_dict[keys]:
if keys == 'price':
matching = [v for v in value.split(',') if l_data in v]
else:
matching = [v for v in value.split(',') if l_data == v.strip()]
if matching:
break
if matching:
data_list.append(matching[0])
else:
data_list.append(None)
matching = ''
df[keys] = data_list
Output
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5 None None
2 52 1 baroque, xs, $3, new None new
colour size price period
0 bronze None $2 neoclassical
1 gold None $5 romanticism
2 None xs $3 baroque
add a comment |
Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach
for keys in main_dict:
data_list =
for value in df.unparsed_features: # for every row
for l_data in main_dict[keys]:
if keys == 'price':
matching = [v for v in value.split(',') if l_data in v]
else:
matching = [v for v in value.split(',') if l_data == v.strip()]
if matching:
break
if matching:
data_list.append(matching[0])
else:
data_list.append(None)
matching = ''
df[keys] = data_list
Output
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5 None None
2 52 1 baroque, xs, $3, new None new
colour size price period
0 bronze None $2 neoclassical
1 gold None $5 romanticism
2 None xs $3 baroque
Frankly, W-B was correct, you need to modify your dict, but to solve with available data below is my approach
for keys in main_dict:
data_list =
for value in df.unparsed_features: # for every row
for l_data in main_dict[keys]:
if keys == 'price':
matching = [v for v in value.split(',') if l_data in v]
else:
matching = [v for v in value.split(',') if l_data == v.strip()]
if matching:
break
if matching:
data_list.append(matching[0])
else:
data_list.append(None)
matching = ''
df[keys] = data_list
Output
feature1 feature2 unparsed_features weight age
0 34 1 neoclassical, heavy, $2, old, bronze heavy old
1 45 0 romanticism, gold, $5 None None
2 52 1 baroque, xs, $3, new None new
colour size price period
0 bronze None $2 neoclassical
1 gold None $5 romanticism
2 None xs $3 baroque
answered Nov 13 '18 at 15:42
iamklausiamklaus
1,110149
1,110149
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53283760%2fpandas-dataframe-split-string-into-multiple-columns-with-conditions-and-missing%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You may need create you own dict , for those certain item belong to which field
– Wen-Ben
Nov 13 '18 at 15:11
Hi, this question may help: stackoverflow.com/questions/17116814/… Answer (scroll down a bit):
df.unparsed_features.str.split(",", expand=True)Do you need more detail than that? You'll need to rename the columns manually after, but if you only have a couple of columns, that may be ok.– Evan
Nov 13 '18 at 15:27
1
Possible duplicate of pandas: How do I split text in a column into multiple rows?
– Evan
Nov 13 '18 at 15:28