Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3

up vote
1
down vote

favorite

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.

I put the CSV File into DataFrame,

df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns

one of the tweets is -

b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'

But when i access this tweet through the command -
df['tweet'][0]

the output is returned in below format -

"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"

I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.

 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'

Screenshot of 'sample.csv'.
enter image description here

As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.

Can anyone please explain why this is happening and how to avoid it?

thanks

edited Nov 10 at 8:17

asked Nov 9 at 21:37

Nakul Sharma

265

Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38

add a comment |

up vote
1
down vote

favorite

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.

I put the CSV File into DataFrame,

df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns

one of the tweets is -

b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'

But when i access this tweet through the command -
df['tweet'][0]

the output is returned in below format -

"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"

I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.

 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'

Screenshot of 'sample.csv'.
enter image description here

As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.

Can anyone please explain why this is happening and how to avoid it?

thanks

edited Nov 10 at 8:17

asked Nov 9 at 21:37

Nakul Sharma

265

Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38

add a comment |

up vote
1
down vote

favorite

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.

I put the CSV File into DataFrame,

df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns

one of the tweets is -

b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'

But when i access this tweet through the command -
df['tweet'][0]

the output is returned in below format -

"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"

I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.

 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'

Screenshot of 'sample.csv'.
enter image description here

As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.

Can anyone please explain why this is happening and how to avoid it?

thanks

edited Nov 10 at 8:17

asked Nov 9 at 21:37

Nakul Sharma

265

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.

I put the CSV File into DataFrame,

df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns

one of the tweets is -

b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'

But when i access this tweet through the command -
df['tweet'][0]

the output is returned in below format -

"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"

I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.

 time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'

Screenshot of 'sample.csv'.
enter image description here

As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.

Can anyone please explain why this is happening and how to avoid it?

thanks

python-3.x pandas dataframe twitter unicode

edited Nov 10 at 8:17

asked Nov 9 at 21:37

Nakul Sharma

265

edited Nov 10 at 8:17

asked Nov 9 at 21:37

Nakul Sharma

265

edited Nov 10 at 8:17

asked Nov 9 at 21:37

Nakul Sharma

265

asked Nov 9 at 21:37

Nakul Sharma

265

asked Nov 9 at 21:37

Nakul Sharma

265

Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38

add a comment |

Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38

Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...' characters.

So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.

One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.

So, after you have your data loaded into your dataframe, this could fix your tweets column:

import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

edited Nov 10 at 12:59

answered Nov 9 at 23:03

jsbueno

54.4k673124

Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11

Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233570%2faccessing-unicode-content-from-dataframe-returns-unicode-content-with-additional%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.

So, after you have your data loaded into your dataframe, this could fix your tweets column:

import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

edited Nov 10 at 12:59

answered Nov 9 at 23:03

jsbueno

54.4k673124

Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11

Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

add a comment |

up vote
1
down vote

accepted

So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.

So, after you have your data loaded into your dataframe, this could fix your tweets column:

import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

edited Nov 10 at 12:59

answered Nov 9 at 23:03

jsbueno

54.4k673124

Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11

Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

add a comment |

up vote
1
down vote

accepted

So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.

So, after you have your data loaded into your dataframe, this could fix your tweets column:

import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

edited Nov 10 at 12:59

answered Nov 9 at 23:03

jsbueno

54.4k673124

So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.

So, after you have your data loaded into your dataframe, this could fix your tweets column:

import ast

df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

edited Nov 10 at 12:59

answered Nov 9 at 23:03

jsbueno

54.4k673124

edited Nov 10 at 12:59

answered Nov 9 at 23:03

jsbueno

54.4k673124

answered Nov 9 at 23:03

jsbueno

54.4k673124

answered Nov 9 at 23:03

jsbueno

54.4k673124

Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11

Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

add a comment |

Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11

Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11

Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb