How to recode string variables with multiple values in one cell in python









up vote
-1
down vote

favorite












I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:



system/component failure
bird/Medical/other


I have tried to encode them by pd.get_dummies() which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:



Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.



My code is as following:



import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)


So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.



 system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1


because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:



 system component bird medical other 
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1


Or if this is completely wrong approach, how can I fix this problem?










share|improve this question



















  • 1




    welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
    – hiro protagonist
    Nov 9 at 20:14










  • I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using pandas, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
    – Blckknght
    Nov 9 at 22:34










  • Thanks for your comment. I will try to improve the quality of my question.
    – homa mohammadpour sadigh
    Nov 12 at 15:26














up vote
-1
down vote

favorite












I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:



system/component failure
bird/Medical/other


I have tried to encode them by pd.get_dummies() which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:



Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.



My code is as following:



import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)


So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.



 system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1


because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:



 system component bird medical other 
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1


Or if this is completely wrong approach, how can I fix this problem?










share|improve this question



















  • 1




    welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
    – hiro protagonist
    Nov 9 at 20:14










  • I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using pandas, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
    – Blckknght
    Nov 9 at 22:34










  • Thanks for your comment. I will try to improve the quality of my question.
    – homa mohammadpour sadigh
    Nov 12 at 15:26












up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:



system/component failure
bird/Medical/other


I have tried to encode them by pd.get_dummies() which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:



Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.



My code is as following:



import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)


So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.



 system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1


because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:



 system component bird medical other 
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1


Or if this is completely wrong approach, how can I fix this problem?










share|improve this question















I have a very big data set that I need to convert most of them to dummy variables. In some of cells I have more than one value as following:



system/component failure
bird/Medical/other


I have tried to encode them by pd.get_dummies() which works well but it gives me large data set with high sparsity since it converts each of cell which have more than one value to a specific column.
An example of one of columns that needs to be encoded:



Security related,
Other,
Undershoot/overshoot,
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Undershoot/overshoot,
OtherSystem/component failure or malfunction [non-powerplant],
Airprox/TCAS alert/loss of separation/(near) midair collisions,
Security related,
System/component failure or malfunction [non-powerplant]Undershoot/overshoot,
Navigation Errors,
AerodromeOther,
.
.
.



My code is as following:



import pandas as pd
import numpy as np
mydata=pd.read_csv("C:/Users/AP22630/Desktop/Project Dataset/Final_Data_Frame.csv", encoding = "ISO-8859-1")
b=mydata.iloc[:,18]
b=pd.get_dummies(b)


So, when I run the pd.get_dummies() for specified column, I got 277 columns in return.



 system/component failure bird/Medical/other
system/component failure 1 0
bird/Medical/other 0 1


because in this case I get following columns(system, component, system/component failure). And (bird,medical, other, bird/medical/other),Since we have all of these values((system, component,bird,medical, other) individually as a value in most of cells and in some cells they are merged.
Is it possible to to have this output:



 system component bird medical other 
system/component failure 1 1 0 0 0
bird/Medical/other 0 0 1 1 1


Or if this is completely wrong approach, how can I fix this problem?







python python-3.x pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 at 17:10

























asked Nov 9 at 20:10









homa mohammadpour sadigh

11




11







  • 1




    welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
    – hiro protagonist
    Nov 9 at 20:14










  • I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using pandas, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
    – Blckknght
    Nov 9 at 22:34










  • Thanks for your comment. I will try to improve the quality of my question.
    – homa mohammadpour sadigh
    Nov 12 at 15:26












  • 1




    welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
    – hiro protagonist
    Nov 9 at 20:14










  • I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using pandas, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
    – Blckknght
    Nov 9 at 22:34










  • Thanks for your comment. I will try to improve the quality of my question.
    – homa mohammadpour sadigh
    Nov 12 at 15:26







1




1




welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14




welcome to stackoverflow. please take the tour and read the help pages. helpful may be how to ask good questions and this question checklist. users here are way more ready to help if you provide minimal, complete, and verifiable example with some input and the desired output.
– hiro protagonist
Nov 9 at 20:14












I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using pandas, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
– Blckknght
Nov 9 at 22:34




I've edited your question to be a bit better formatted, and to have a few more relevant tags. Please add more information though, as we still don't know nearly enough about your situation to help you. I had to guess that you'd using pandas, but you should spell out in more detail what libraries you're using to process your data, and you should show some of the code you are using. We can't read your mind to understand what your data looks like, or how you're getting it, you need to tell us!
– Blckknght
Nov 9 at 22:34












Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26




Thanks for your comment. I will try to improve the quality of my question.
– homa mohammadpour sadigh
Nov 12 at 15:26

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232683%2fhow-to-recode-string-variables-with-multiple-values-in-one-cell-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232683%2fhow-to-recode-string-variables-with-multiple-values-in-one-cell-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo