How to recode string variables with multiple values in one cell in python
up vote
-1
down vote
favorite
I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:
system/component failure
bird/Medical/other
I have tried to encode them by pd.get_dummies()
which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:
Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.
My code is as following:
import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)
So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.
system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1
because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:
system component bird medical other
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1
Or if this is completely wrong approach, how can I fix this problem?
python python-3.x pandas
add a comment |
up vote
-1
down vote
favorite
I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:
system/component failure
bird/Medical/other
I have tried to encode them by pd.get_dummies()
which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:
Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.
My code is as following:
import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)
So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.
system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1
because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:
system component bird medical other
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1
Or if this is completely wrong approach, how can I fix this problem?
python python-3.x pandas
1
welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14
I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd usingpandas
, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
– Blckknght
Nov 9 at 22:34
Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:
system/component failure
bird/Medical/other
I have tried to encode them by pd.get_dummies()
which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:
Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.
My code is as following:
import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)
So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.
system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1
because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:
system component bird medical other
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1
Or if this is completely wrong approach, how can I fix this problem?
python python-3.x pandas
I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:
system/component failure
bird/Medical/other
I have tried to encode them by pd.get_dummies()
which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:
Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.
My code is as following:
import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)
So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.
system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1
because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:
system component bird medical other
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1
Or if this is completely wrong approach, how can I fix this problem?
python python-3.x pandas
python python-3.x pandas
edited Nov 13 at 17:10
asked Nov 9 at 20:10
homa mohammadpour sadigh
11
11
1
welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14
I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd usingpandas
, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
– Blckknght
Nov 9 at 22:34
Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26
add a comment |
1
welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14
I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd usingpandas
, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
– Blckknght
Nov 9 at 22:34
Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26
1
1
welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14
welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14
I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using
pandas
, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!– Blckknght
Nov 9 at 22:34
I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using
pandas
, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!– Blckknght
Nov 9 at 22:34
Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26
Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232683%2fhow-to-recode-string-variables-with-multiple-values-in-one-cell-in-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14
I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using
pandas
, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!– Blckknght
Nov 9 at 22:34
Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26