Unable to match XML element using Python regular expression









up vote
0
down vote

favorite












I have an XML document with the following structure-



> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>


I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-



file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)


But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-




category_text = re.findall(r'(.+)', xml_doc)




This does the job.
Any idea(s) as to why the ... XML element code is not working?



Thanks!










share|improve this question

























    up vote
    0
    down vote

    favorite












    I have an XML document with the following structure-



    > <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
    > [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
    > 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
    > <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
    > <title>Postmodern art</title> <id>192127</id> <revision>
    > <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
    > <contributor> <username>FairuseBot</username> <id>1022055</id>
    > </contributor> </revision> <categories> <category>Contemporary
    > art</category> <category>Modernism</category> <category>Art
    > movements</category> <category>Postmodern art</category> </categories>
    > </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
    > Postchristianity Postmodern philosophy Postmodern architecture
    > Postmodern art Postmodernist film Postmodern literature Postmodern
    > music Postmodern theater Critical theory Globalization Consumerism
    > </bdy>


    I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-



    file = open("sample_xml.xml", "r")
    xml_doc = file.read()
    file.close()

    body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)


    But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-




    category_text = re.findall(r'(.+)', xml_doc)




    This does the job.
    Any idea(s) as to why the ... XML element code is not working?



    Thanks!










    share|improve this question























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have an XML document with the following structure-



      > <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
      > [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
      > 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
      > <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
      > <title>Postmodern art</title> <id>192127</id> <revision>
      > <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
      > <contributor> <username>FairuseBot</username> <id>1022055</id>
      > </contributor> </revision> <categories> <category>Contemporary
      > art</category> <category>Modernism</category> <category>Art
      > movements</category> <category>Postmodern art</category> </categories>
      > </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
      > Postchristianity Postmodern philosophy Postmodern architecture
      > Postmodern art Postmodernist film Postmodern literature Postmodern
      > music Postmodern theater Critical theory Globalization Consumerism
      > </bdy>


      I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-



      file = open("sample_xml.xml", "r")
      xml_doc = file.read()
      file.close()

      body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)


      But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-




      category_text = re.findall(r'(.+)', xml_doc)




      This does the job.
      Any idea(s) as to why the ... XML element code is not working?



      Thanks!










      share|improve this question













      I have an XML document with the following structure-



      > <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
      > [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
      > 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
      > <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
      > <title>Postmodern art</title> <id>192127</id> <revision>
      > <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
      > <contributor> <username>FairuseBot</username> <id>1022055</id>
      > </contributor> </revision> <categories> <category>Contemporary
      > art</category> <category>Modernism</category> <category>Art
      > movements</category> <category>Postmodern art</category> </categories>
      > </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
      > Postchristianity Postmodern philosophy Postmodern architecture
      > Postmodern art Postmodernist film Postmodern literature Postmodern
      > music Postmodern theater Critical theory Globalization Consumerism
      > </bdy>


      I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-



      file = open("sample_xml.xml", "r")
      xml_doc = file.read()
      file.close()

      body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)


      But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-




      category_text = re.findall(r'(.+)', xml_doc)




      This does the job.
      Any idea(s) as to why the ... XML element code is not working?



      Thanks!







      regex python-3.x






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 9 at 23:55









      Arun

      110218




      110218






















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          The special character . will not match a newline, so that regex will not match a multiline string.



          You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)



          More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax






          share|improve this answer



























            up vote
            1
            down vote













            You can use re.DOTALL



            category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)


            Output:



            [" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]





            share|improve this answer




















              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234754%2funable-to-match-xml-element-using-python-regular-expression%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              2
              down vote



              accepted










              The special character . will not match a newline, so that regex will not match a multiline string.



              You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)



              More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax






              share|improve this answer
























                up vote
                2
                down vote



                accepted










                The special character . will not match a newline, so that regex will not match a multiline string.



                You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)



                More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax






                share|improve this answer






















                  up vote
                  2
                  down vote



                  accepted







                  up vote
                  2
                  down vote



                  accepted






                  The special character . will not match a newline, so that regex will not match a multiline string.



                  You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)



                  More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax






                  share|improve this answer












                  The special character . will not match a newline, so that regex will not match a multiline string.



                  You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)



                  More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 10 at 0:23









                  TheGreatGeek

                  462




                  462






















                      up vote
                      1
                      down vote













                      You can use re.DOTALL



                      category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)


                      Output:



                      [" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]





                      share|improve this answer
























                        up vote
                        1
                        down vote













                        You can use re.DOTALL



                        category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)


                        Output:



                        [" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]





                        share|improve this answer






















                          up vote
                          1
                          down vote










                          up vote
                          1
                          down vote









                          You can use re.DOTALL



                          category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)


                          Output:



                          [" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]





                          share|improve this answer












                          You can use re.DOTALL



                          category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)


                          Output:



                          [" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 10 at 1:26









                          Ashok KS

                          203214




                          203214



























                               

                              draft saved


                              draft discarded















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234754%2funable-to-match-xml-element-using-python-regular-expression%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Use pre created SQLite database for Android project in kotlin

                              Darth Vader #20

                              Ondo