Unable to match XML element using Python regular expression
up vote
0
down vote
favorite
I have an XML document with the following structure-
> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>
I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-
file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()
body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)
But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-
category_text = re.findall(r'(.+)', xml_doc)
This does the job.
Any idea(s) as to why the ... XML element code is not working?
Thanks!
regex python-3.x
add a comment |
up vote
0
down vote
favorite
I have an XML document with the following structure-
> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>
I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-
file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()
body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)
But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-
category_text = re.findall(r'(.+)', xml_doc)
This does the job.
Any idea(s) as to why the ... XML element code is not working?
Thanks!
regex python-3.x
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have an XML document with the following structure-
> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>
I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-
file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()
body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)
But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-
category_text = re.findall(r'(.+)', xml_doc)
This does the job.
Any idea(s) as to why the ... XML element code is not working?
Thanks!
regex python-3.x
I have an XML document with the following structure-
> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>
I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-
file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()
body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)
But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-
category_text = re.findall(r'(.+)', xml_doc)
This does the job.
Any idea(s) as to why the ... XML element code is not working?
Thanks!
regex python-3.x
regex python-3.x
asked Nov 9 at 23:55
Arun
110218
110218
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
The special character .
will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax
add a comment |
up vote
1
down vote
You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
The special character .
will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax
add a comment |
up vote
2
down vote
accepted
The special character .
will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
The special character .
will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax
The special character .
will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax
answered Nov 10 at 0:23
TheGreatGeek
462
462
add a comment |
add a comment |
up vote
1
down vote
You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]
add a comment |
up vote
1
down vote
You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]
add a comment |
up vote
1
down vote
up vote
1
down vote
You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]
You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]
answered Nov 10 at 1:26
Ashok KS
203214
203214
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234754%2funable-to-match-xml-element-using-python-regular-expression%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown