How do I handle new line characters in my sentences? - spacy NER

Multi tool use
up vote
-1
down vote
favorite
I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error
TRAIN_DATA = [('Who is
^
SyntaxError: EOL when scanning string literal
What should I do with these?
Data looks like this (
TRAIN_DATA = [('Who is
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),
python spacy
add a comment |
up vote
-1
down vote
favorite
I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error
TRAIN_DATA = [('Who is
^
SyntaxError: EOL when scanning string literal
What should I do with these?
Data looks like this (
TRAIN_DATA = [('Who is
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),
python spacy
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error
TRAIN_DATA = [('Who is
^
SyntaxError: EOL when scanning string literal
What should I do with these?
Data looks like this (
TRAIN_DATA = [('Who is
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),
python spacy
I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error
TRAIN_DATA = [('Who is
^
SyntaxError: EOL when scanning string literal
What should I do with these?
Data looks like this (
TRAIN_DATA = [('Who is
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),
python spacy
python spacy
edited Nov 10 at 7:50
Yaman Jain
803812
803812
asked Nov 10 at 4:32


erotavlas
1,4271642
1,4271642
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
Jupyter
If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this
string=""" This string has many lines
that continues here
and here """
In your case that would be
TRAIN_DATA = [('''Who is
Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])
Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.
Remove newline character
If you want to remove the newlines within string there are many options. Here is one
import re
string = re.sub('n', '', string)
Explanation
- Line: import of regex modul
Line: Use method 'sub' that
substitutes first input 'n' with '' in string.out:
' This string has many lines that continues here and here '
Im guessing that you might be using pandas, so to do this on a column you can do the following:
df[col_name]=df[col_name].str.replace(r'^n','')
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
1
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
Jupyter
If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this
string=""" This string has many lines
that continues here
and here """
In your case that would be
TRAIN_DATA = [('''Who is
Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])
Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.
Remove newline character
If you want to remove the newlines within string there are many options. Here is one
import re
string = re.sub('n', '', string)
Explanation
- Line: import of regex modul
Line: Use method 'sub' that
substitutes first input 'n' with '' in string.out:
' This string has many lines that continues here and here '
Im guessing that you might be using pandas, so to do this on a column you can do the following:
df[col_name]=df[col_name].str.replace(r'^n','')
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
1
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
add a comment |
up vote
3
down vote
accepted
Jupyter
If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this
string=""" This string has many lines
that continues here
and here """
In your case that would be
TRAIN_DATA = [('''Who is
Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])
Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.
Remove newline character
If you want to remove the newlines within string there are many options. Here is one
import re
string = re.sub('n', '', string)
Explanation
- Line: import of regex modul
Line: Use method 'sub' that
substitutes first input 'n' with '' in string.out:
' This string has many lines that continues here and here '
Im guessing that you might be using pandas, so to do this on a column you can do the following:
df[col_name]=df[col_name].str.replace(r'^n','')
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
1
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
Jupyter
If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this
string=""" This string has many lines
that continues here
and here """
In your case that would be
TRAIN_DATA = [('''Who is
Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])
Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.
Remove newline character
If you want to remove the newlines within string there are many options. Here is one
import re
string = re.sub('n', '', string)
Explanation
- Line: import of regex modul
Line: Use method 'sub' that
substitutes first input 'n' with '' in string.out:
' This string has many lines that continues here and here '
Im guessing that you might be using pandas, so to do this on a column you can do the following:
df[col_name]=df[col_name].str.replace(r'^n','')
Jupyter
If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this
string=""" This string has many lines
that continues here
and here """
In your case that would be
TRAIN_DATA = [('''Who is
Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])
Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.
Remove newline character
If you want to remove the newlines within string there are many options. Here is one
import re
string = re.sub('n', '', string)
Explanation
- Line: import of regex modul
Line: Use method 'sub' that
substitutes first input 'n' with '' in string.out:
' This string has many lines that continues here and here '
Im guessing that you might be using pandas, so to do this on a column you can do the following:
df[col_name]=df[col_name].str.replace(r'^n','')
edited Nov 10 at 5:55
answered Nov 10 at 5:47
Philip
299112
299112
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
1
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
add a comment |
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
1
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
– erotavlas
Nov 10 at 17:33
1
1
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
– Philip
Nov 12 at 13:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236010%2fhow-do-i-handle-new-line-characters-in-my-sentences-spacy-ner%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
cgaxu KW qSHps42Unz