How do I handle new line characters in my sentences?

How do I handle new line characters in my sentences? - spacy NER

up vote
-1
down vote

favorite

I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error

 TRAIN_DATA = [('Who is 
 ^
SyntaxError: EOL when scanning string literal

What should I do with these?

Data looks like this (

TRAIN_DATA = [('Who is 
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),

edited Nov 10 at 7:50

Yaman Jain

803812

asked Nov 10 at 4:32

erotavlas

1,4271642

add a comment |

up vote
-1
down vote

favorite

I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error

 TRAIN_DATA = [('Who is 
 ^
SyntaxError: EOL when scanning string literal

What should I do with these?

Data looks like this (

TRAIN_DATA = [('Who is 
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),

edited Nov 10 at 7:50

Yaman Jain

803812

asked Nov 10 at 4:32

erotavlas

1,4271642

add a comment |

up vote
-1
down vote

favorite

I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error

 TRAIN_DATA = [('Who is 
 ^
SyntaxError: EOL when scanning string literal

What should I do with these?

Data looks like this (

TRAIN_DATA = [('Who is 
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),

edited Nov 10 at 7:50

Yaman Jain

803812

asked Nov 10 at 4:32

erotavlas

1,4271642

I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error

 TRAIN_DATA = [('Who is 
 ^
SyntaxError: EOL when scanning string literal

What should I do with these?

Data looks like this (

TRAIN_DATA = [('Who is 
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),

python spacy

edited Nov 10 at 7:50

Yaman Jain

803812

asked Nov 10 at 4:32

erotavlas

1,4271642

edited Nov 10 at 7:50

Yaman Jain

803812

asked Nov 10 at 4:32

erotavlas

1,4271642

edited Nov 10 at 7:50

Yaman Jain

803812

edited Nov 10 at 7:50

Yaman Jain

803812

edited Nov 10 at 7:50

Yaman Jain

803812

asked Nov 10 at 4:32

erotavlas

1,4271642

asked Nov 10 at 4:32

erotavlas

1,4271642

asked Nov 10 at 4:32

erotavlas

1,4271642

add a comment |

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

Jupyter

If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this

string=""" This string has many lines
 that continues here
 and here """

In your case that would be

TRAIN_DATA = [('''Who is 
 Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])

Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.

Remove newline character

If you want to remove the newlines within string there are many options. Here is one

import re
string = re.sub('n', '', string)

Explanation

Line: import of regex modul

Line: Use method 'sub' that
substitutes first input 'n' with '' in string.

out:
' This string has many lines that continues here and here '

Im guessing that you might be using pandas, so to do this on a column you can do the following:

df[col_name]=df[col_name].str.replace(r'^n','')

edited Nov 10 at 5:55

answered Nov 10 at 5:47

Philip

299112

The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33

1

It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236010%2fhow-do-i-handle-new-line-characters-in-my-sentences-spacy-ner%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

Jupyter

If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this

string=""" This string has many lines
 that continues here
 and here """

In your case that would be

TRAIN_DATA = [('''Who is 
 Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])

Remove newline character

If you want to remove the newlines within string there are many options. Here is one

import re
string = re.sub('n', '', string)

Explanation

Line: import of regex modul

Line: Use method 'sub' that
substitutes first input 'n' with '' in string.

out:
' This string has many lines that continues here and here '

Im guessing that you might be using pandas, so to do this on a column you can do the following:

df[col_name]=df[col_name].str.replace(r'^n','')

edited Nov 10 at 5:55

answered Nov 10 at 5:47

Philip

299112

The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33

1

It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27

add a comment |

up vote
3
down vote

accepted

Jupyter

If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this

string=""" This string has many lines
 that continues here
 and here """

In your case that would be

TRAIN_DATA = [('''Who is 
 Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])

Remove newline character

If you want to remove the newlines within string there are many options. Here is one

import re
string = re.sub('n', '', string)

Explanation

Line: import of regex modul

Line: Use method 'sub' that
substitutes first input 'n' with '' in string.

out:
' This string has many lines that continues here and here '

Im guessing that you might be using pandas, so to do this on a column you can do the following:

df[col_name]=df[col_name].str.replace(r'^n','')

edited Nov 10 at 5:55

answered Nov 10 at 5:47

Philip

299112

The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33

1

It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27

add a comment |

up vote
3
down vote

accepted

Jupyter

If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this

string=""" This string has many lines
 that continues here
 and here """

In your case that would be

TRAIN_DATA = [('''Who is 
 Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])

Remove newline character

If you want to remove the newlines within string there are many options. Here is one

import re
string = re.sub('n', '', string)

Explanation

Line: import of regex modul

Line: Use method 'sub' that
substitutes first input 'n' with '' in string.

out:
' This string has many lines that continues here and here '

Im guessing that you might be using pandas, so to do this on a column you can do the following:

df[col_name]=df[col_name].str.replace(r'^n','')

edited Nov 10 at 5:55

answered Nov 10 at 5:47

Philip

299112

Jupyter

If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this

string=""" This string has many lines
 that continues here
 and here """

In your case that would be

TRAIN_DATA = [('''Who is 
 Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])

Remove newline character

If you want to remove the newlines within string there are many options. Here is one

import re
string = re.sub('n', '', string)

Explanation

Line: import of regex modul

Line: Use method 'sub' that
substitutes first input 'n' with '' in string.

out:
' This string has many lines that continues here and here '

Im guessing that you might be using pandas, so to do this on a column you can do the following:

df[col_name]=df[col_name].str.replace(r'^n','')

edited Nov 10 at 5:55

answered Nov 10 at 5:47

Philip

299112

edited Nov 10 at 5:55

answered Nov 10 at 5:47

Philip

299112

answered Nov 10 at 5:47

Philip

299112

answered Nov 10 at 5:47

Philip

299112

The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33

1

It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27

add a comment |

The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33

1

It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27

The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33

It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb