How do I handle new line characters in my sentences? - spacy NER









up vote
-1
down vote

favorite












I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error




 TRAIN_DATA = [('Who is 
^
SyntaxError: EOL when scanning string literal



What should I do with these?



Data looks like this (



TRAIN_DATA = [('Who is 
Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),









share|improve this question



























    up vote
    -1
    down vote

    favorite












    I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error




     TRAIN_DATA = [('Who is 
    ^
    SyntaxError: EOL when scanning string literal



    What should I do with these?



    Data looks like this (



    TRAIN_DATA = [('Who is 
    Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),









    share|improve this question

























      up vote
      -1
      down vote

      favorite









      up vote
      -1
      down vote

      favorite











      I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error




       TRAIN_DATA = [('Who is 
      ^
      SyntaxError: EOL when scanning string literal



      What should I do with these?



      Data looks like this (



      TRAIN_DATA = [('Who is 
      Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),









      share|improve this question















      I'm trying to train spacy NER, and I collected all my sentences, but many have embedded new line characters 'n' so when I enter the train data into my jupyter notebook, it failed with an error




       TRAIN_DATA = [('Who is 
      ^
      SyntaxError: EOL when scanning string literal



      What should I do with these?



      Data looks like this (



      TRAIN_DATA = [('Who is 
      Shaka Khan?', 'entities': [(7, 17, 'PERSON')]),






      python spacy






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 10 at 7:50









      Yaman Jain

      803812




      803812










      asked Nov 10 at 4:32









      erotavlas

      1,4271642




      1,4271642






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          Jupyter



          If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this



          string=""" This string has many lines
          that continues here
          and here """


          In your case that would be



          TRAIN_DATA = [('''Who is 
          Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])


          Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.



          Remove newline character



          If you want to remove the newlines within string there are many options. Here is one



          import re
          string = re.sub('n', '', string)


          Explanation



          1. Line: import of regex modul


          2. Line: Use method 'sub' that
            substitutes first input 'n' with '' in string.



            out:
            ' This string has many lines that continues here and here '



          Im guessing that you might be using pandas, so to do this on a column you can do the following:



          df[col_name]=df[col_name].str.replace(r'^n','')





          share|improve this answer






















          • The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
            – erotavlas
            Nov 10 at 17:33







          • 1




            It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
            – Philip
            Nov 12 at 13:27










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236010%2fhow-do-i-handle-new-line-characters-in-my-sentences-spacy-ner%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote



          accepted










          Jupyter



          If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this



          string=""" This string has many lines
          that continues here
          and here """


          In your case that would be



          TRAIN_DATA = [('''Who is 
          Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])


          Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.



          Remove newline character



          If you want to remove the newlines within string there are many options. Here is one



          import re
          string = re.sub('n', '', string)


          Explanation



          1. Line: import of regex modul


          2. Line: Use method 'sub' that
            substitutes first input 'n' with '' in string.



            out:
            ' This string has many lines that continues here and here '



          Im guessing that you might be using pandas, so to do this on a column you can do the following:



          df[col_name]=df[col_name].str.replace(r'^n','')





          share|improve this answer






















          • The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
            – erotavlas
            Nov 10 at 17:33







          • 1




            It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
            – Philip
            Nov 12 at 13:27














          up vote
          3
          down vote



          accepted










          Jupyter



          If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this



          string=""" This string has many lines
          that continues here
          and here """


          In your case that would be



          TRAIN_DATA = [('''Who is 
          Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])


          Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.



          Remove newline character



          If you want to remove the newlines within string there are many options. Here is one



          import re
          string = re.sub('n', '', string)


          Explanation



          1. Line: import of regex modul


          2. Line: Use method 'sub' that
            substitutes first input 'n' with '' in string.



            out:
            ' This string has many lines that continues here and here '



          Im guessing that you might be using pandas, so to do this on a column you can do the following:



          df[col_name]=df[col_name].str.replace(r'^n','')





          share|improve this answer






















          • The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
            – erotavlas
            Nov 10 at 17:33







          • 1




            It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
            – Philip
            Nov 12 at 13:27












          up vote
          3
          down vote



          accepted







          up vote
          3
          down vote



          accepted






          Jupyter



          If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this



          string=""" This string has many lines
          that continues here
          and here """


          In your case that would be



          TRAIN_DATA = [('''Who is 
          Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])


          Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.



          Remove newline character



          If you want to remove the newlines within string there are many options. Here is one



          import re
          string = re.sub('n', '', string)


          Explanation



          1. Line: import of regex modul


          2. Line: Use method 'sub' that
            substitutes first input 'n' with '' in string.



            out:
            ' This string has many lines that continues here and here '



          Im guessing that you might be using pandas, so to do this on a column you can do the following:



          df[col_name]=df[col_name].str.replace(r'^n','')





          share|improve this answer














          Jupyter



          If the problem is in jupyter you need to have 3x ' around strings that are on several lines like this



          string=""" This string has many lines
          that continues here
          and here """


          In your case that would be



          TRAIN_DATA = [('''Who is 
          Shaka Khan?''', 'entities': [(7, 17, 'PERSON')])


          Correct me if I'm wrong, but it looks like you've copy pasted the data, which is why this can happen. You could simply resolve the issue within Jupyter by just deleting the newline. Alternatively I would suggest that you import data to Jupyter not using copy paste.



          Remove newline character



          If you want to remove the newlines within string there are many options. Here is one



          import re
          string = re.sub('n', '', string)


          Explanation



          1. Line: import of regex modul


          2. Line: Use method 'sub' that
            substitutes first input 'n' with '' in string.



            out:
            ' This string has many lines that continues here and here '



          Im guessing that you might be using pandas, so to do this on a column you can do the following:



          df[col_name]=df[col_name].str.replace(r'^n','')






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 10 at 5:55

























          answered Nov 10 at 5:47









          Philip

          299112




          299112











          • The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
            – erotavlas
            Nov 10 at 17:33







          • 1




            It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
            – Philip
            Nov 12 at 13:27
















          • The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
            – erotavlas
            Nov 10 at 17:33







          • 1




            It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
            – Philip
            Nov 12 at 13:27















          The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
          – erotavlas
          Nov 10 at 17:33





          The original text contains the newline characters. I can easily remove them, but if I do so I would have to replace it with another character of equivalent length probably whitespace in order to maintain the start and end index of the entity. Will removing the original newline have any impact on the model generated? Also I generated the train data from a c# program, I may have to tweak it a bit to output the triple quote if I go that route.
          – erotavlas
          Nov 10 at 17:33





          1




          1




          It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
          – Philip
          Nov 12 at 13:27




          It think it depends on what you are trying to achieve. If you are trying to analyze sentences I would suggest replacing newline characters with white space. Unless the newline carry a meaning - then replace it with a end-of-sentence-character that can be used when you are tokenizing the sentence. It is my guess that removing the newline character will not have impact on the generated model at this. However, you can easily test it afterwards.
          – Philip
          Nov 12 at 13:27

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236010%2fhow-do-i-handle-new-line-characters-in-my-sentences-spacy-ner%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

          Syphilis

          Darth Vader #20