Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3









up vote
1
down vote

favorite












I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks










share|improve this question























  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    Nov 10 at 5:38















up vote
1
down vote

favorite












I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks










share|improve this question























  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    Nov 10 at 5:38













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks










share|improve this question















I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.



I put the CSV File into DataFrame,



df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns


one of the tweets is -



b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'


But when i access this tweet through the command -
df['tweet'][0]



the output is returned in below format -



"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"


I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.



 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'


Screenshot of 'sample.csv'.
enter image description here



As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.



Can anyone please explain why this is happening and how to avoid it?



thanks







python-3.x pandas dataframe twitter unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 8:17

























asked Nov 9 at 21:37









Nakul Sharma

265




265











  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    Nov 10 at 5:38

















  • Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
    – Mark Tolonen
    Nov 10 at 5:38
















Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38





Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38













1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer






















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    Nov 10 at 8:11










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    Nov 10 at 8:16










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233570%2faccessing-unicode-content-from-dataframe-returns-unicode-content-with-additional%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer






















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    Nov 10 at 8:11










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    Nov 10 at 8:16














up vote
1
down vote



accepted










You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer






















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    Nov 10 at 8:11










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    Nov 10 at 8:16












up vote
1
down vote



accepted







up vote
1
down vote



accepted






You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)





share|improve this answer














You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.



So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.



One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.



So, after you have your data loaded into your dataframe, this could fix your tweets column:



import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 10 at 12:59

























answered Nov 9 at 23:03









jsbueno

54.4k673124




54.4k673124











  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    Nov 10 at 8:11










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    Nov 10 at 8:16
















  • Thank you so much @jsbueno. Your solution worked like charm.
    – Nakul Sharma
    Nov 10 at 8:11










  • Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
    – Nakul Sharma
    Nov 10 at 8:16















Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11




Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11












Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16




Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233570%2faccessing-unicode-content-from-dataframe-returns-unicode-content-with-additional%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo