Unable to match XML element using Python regular expression

up vote
0
down vote

favorite

I have an XML document with the following structure-

> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>

I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-

file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)

But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-

category_text = re.findall(r'(.+)', xml_doc)

This does the job.
Any idea(s) as to why the ... XML element code is not working?

Thanks!

asked Nov 9 at 23:55

Arun

110218

add a comment |

up vote
0
down vote

favorite

I have an XML document with the following structure-

> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>

I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-

file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)

But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-

category_text = re.findall(r'(.+)', xml_doc)

This does the job.
Any idea(s) as to why the ... XML element code is not working?

Thanks!

asked Nov 9 at 23:55

Arun

110218

add a comment |

up vote
0
down vote

favorite

I have an XML document with the following structure-

> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>

I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-

file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)

But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-

category_text = re.findall(r'(.+)', xml_doc)

This does the job.
Any idea(s) as to why the ... XML element code is not working?

Thanks!

asked Nov 9 at 23:55

Arun

110218

I have an XML document with the following structure-

> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI@UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>

I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-

file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()

body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)

But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-

category_text = re.findall(r'(.+)', xml_doc)

This does the job.
Any idea(s) as to why the ... XML element code is not working?

Thanks!

regex python-3.x

asked Nov 9 at 23:55

Arun

110218

asked Nov 9 at 23:55

Arun

110218

asked Nov 9 at 23:55

Arun

110218

asked Nov 9 at 23:55

Arun

110218

asked Nov 9 at 23:55

Arun

110218

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

The special character . will not match a newline, so that regex will not match a multiline string.

You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

answered Nov 10 at 0:23

TheGreatGeek

462

add a comment |

up vote
1
down vote

You can use re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output:

[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]

answered Nov 10 at 1:26

Ashok KS

203214

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234754%2funable-to-match-xml-element-using-python-regular-expression%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

The special character . will not match a newline, so that regex will not match a multiline string.

You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

answered Nov 10 at 0:23

TheGreatGeek

462

add a comment |

up vote
2
down vote

accepted

The special character . will not match a newline, so that regex will not match a multiline string.

You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

answered Nov 10 at 0:23

TheGreatGeek

462

add a comment |

up vote
2
down vote

accepted

The special character . will not match a newline, so that regex will not match a multiline string.

You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

answered Nov 10 at 0:23

TheGreatGeek

462

The special character . will not match a newline, so that regex will not match a multiline string.

You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)

More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax

answered Nov 10 at 0:23

TheGreatGeek

462

answered Nov 10 at 0:23

TheGreatGeek

462

answered Nov 10 at 0:23

TheGreatGeek

462

answered Nov 10 at 0:23

TheGreatGeek

462

add a comment |

up vote
1
down vote

You can use re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output:

[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]

answered Nov 10 at 1:26

Ashok KS

203214

add a comment |

up vote
1
down vote

You can use re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output:

[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]

answered Nov 10 at 1:26

Ashok KS

203214

add a comment |

up vote
1
down vote

You can use re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output:

[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]

answered Nov 10 at 1:26

Ashok KS

203214

You can use re.DOTALL

category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)

Output:

[" Postmodernism preceded by Modernism '' Postmodernityn> Postchristianity Postmodern philosophy Postmodern architecturen> Postmodern art Postmodernist film Postmodern literature Postmodernn> music Postmodern theater Critical theory Globalization Consumerismn> "]

answered Nov 10 at 1:26

Ashok KS

203214

answered Nov 10 at 1:26

Ashok KS

203214

answered Nov 10 at 1:26

Ashok KS

203214

answered Nov 10 at 1:26

Ashok KS

203214

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Bh QP MO615RRb2H01G07Kurqo0NVHOm1,OaopR9YwInD,p H84,V0

搜尋此網誌

Pfthb