Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3
up vote
1
down vote
favorite
I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
python-3.x pandas dataframe twitter unicode
add a comment |
up vote
1
down vote
favorite
I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
python-3.x pandas dataframe twitter unicode
Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
python-3.x pandas dataframe twitter unicode
I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee xf0x9fx98x82xf0x9fx98x82xf0x9fx91x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map thatxe2x80x99s confusing.xe2x80xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And theyxe2x80xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
python-3.x pandas dataframe twitter unicode
python-3.x pandas dataframe twitter unicode
edited Nov 10 at 8:17
asked Nov 9 at 21:37
Nakul Sharma
265
265
Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38
add a comment |
Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38
Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38
Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...'
characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'
), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval
,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...'
characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'
), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval
,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
add a comment |
up vote
1
down vote
accepted
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...'
characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'
), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval
,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...'
characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'
), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval
,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'xff...'
characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'
), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval
,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)
edited Nov 10 at 12:59
answered Nov 9 at 23:03
jsbueno
54.4k673124
54.4k673124
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
add a comment |
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Thank you so much @jsbueno. Your solution worked like charm.
– Nakul Sharma
Nov 10 at 8:11
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
Since you asked, wanted to mention that the csv content was same as that of csv file, please see my edited post. It now consists of screen shot of sample.csv file. Thank you once again.
– Nakul Sharma
Nov 10 at 8:16
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233570%2faccessing-unicode-content-from-dataframe-returns-unicode-content-with-additional%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Show some sample lines from the original .CSV. It looks like it was written incorrectly in the first place. If you wrote the CSV, you might ask a new question about how to read from the API and write it to the CSV correctly. This looks like an XY Problem.
– Mark Tolonen
Nov 10 at 5:38