Skipping XML elements using Regular Expressions in Python 3









up vote
1
down vote

favorite












I have an XML document where I wish to extract certain text contained in specific tags such as-



<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>

<bdy>
some text
</bdy>


In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-



# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()

title_text = re.findall(r'<title>.+</title>', xml_doc)

if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")


It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-



<title>Four-minute warning</title>


My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.



Thanks for your help!










share|improve this question

















  • 1




    Don't use regex to parse XML.
    – Johan Wentholt
    Nov 9 at 22:12











  • I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
    – Arun
    Nov 9 at 23:27










  • You could read the file as sting and wrap the contents into a root tag. valid_xml = f'<document>xml_file_contents</document>'. Then use the result as input for ElementTree.
    – Johan Wentholt
    Nov 9 at 23:51











  • @Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any <title> tag inside a pair of <title>...</title> tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title> tags (e.g. <title xmlns:blabla="...">)
    – Luis Colorado
    Nov 15 at 10:06















up vote
1
down vote

favorite












I have an XML document where I wish to extract certain text contained in specific tags such as-



<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>

<bdy>
some text
</bdy>


In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-



# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()

title_text = re.findall(r'<title>.+</title>', xml_doc)

if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")


It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-



<title>Four-minute warning</title>


My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.



Thanks for your help!










share|improve this question

















  • 1




    Don't use regex to parse XML.
    – Johan Wentholt
    Nov 9 at 22:12











  • I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
    – Arun
    Nov 9 at 23:27










  • You could read the file as sting and wrap the contents into a root tag. valid_xml = f'<document>xml_file_contents</document>'. Then use the result as input for ElementTree.
    – Johan Wentholt
    Nov 9 at 23:51











  • @Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any <title> tag inside a pair of <title>...</title> tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title> tags (e.g. <title xmlns:blabla="...">)
    – Luis Colorado
    Nov 15 at 10:06













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have an XML document where I wish to extract certain text contained in specific tags such as-



<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>

<bdy>
some text
</bdy>


In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-



# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()

title_text = re.findall(r'<title>.+</title>', xml_doc)

if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")


It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-



<title>Four-minute warning</title>


My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.



Thanks for your help!










share|improve this question













I have an XML document where I wish to extract certain text contained in specific tags such as-



<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>

<bdy>
some text
</bdy>


In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-



# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()

title_text = re.findall(r'<title>.+</title>', xml_doc)

if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")


It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-



<title>Four-minute warning</title>


My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.



Thanks for your help!







python regex






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 9 at 21:47









Arun

110218




110218







  • 1




    Don't use regex to parse XML.
    – Johan Wentholt
    Nov 9 at 22:12











  • I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
    – Arun
    Nov 9 at 23:27










  • You could read the file as sting and wrap the contents into a root tag. valid_xml = f'<document>xml_file_contents</document>'. Then use the result as input for ElementTree.
    – Johan Wentholt
    Nov 9 at 23:51











  • @Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any <title> tag inside a pair of <title>...</title> tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title> tags (e.g. <title xmlns:blabla="...">)
    – Luis Colorado
    Nov 15 at 10:06













  • 1




    Don't use regex to parse XML.
    – Johan Wentholt
    Nov 9 at 22:12











  • I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
    – Arun
    Nov 9 at 23:27










  • You could read the file as sting and wrap the contents into a root tag. valid_xml = f'<document>xml_file_contents</document>'. Then use the result as input for ElementTree.
    – Johan Wentholt
    Nov 9 at 23:51











  • @Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any <title> tag inside a pair of <title>...</title> tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title> tags (e.g. <title xmlns:blabla="...">)
    – Luis Colorado
    Nov 15 at 10:06








1




1




Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12





Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12













I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27




I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27












You could read the file as sting and wrap the contents into a root tag. valid_xml = f'<document>xml_file_contents</document>'. Then use the result as input for ElementTree.
– Johan Wentholt
Nov 9 at 23:51





You could read the file as sting and wrap the contents into a root tag. valid_xml = f'<document>xml_file_contents</document>'. Then use the result as input for ElementTree.
– Johan Wentholt
Nov 9 at 23:51













@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any <title> tag inside a pair of <title>...</title> tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title> tags (e.g. <title xmlns:blabla="...">)
– Luis Colorado
Nov 15 at 10:06





@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any <title> tag inside a pair of <title>...</title> tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title> tags (e.g. <title xmlns:blabla="...">)
– Luis Colorado
Nov 15 at 10:06













1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:



import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning





share|improve this answer
















  • 1




    Hopefully OP's incoming data doesn't ever contain any tags with attributes...
    – mypetlion
    Nov 9 at 21:58






  • 2




    @mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
    – benvc
    Nov 9 at 22:02










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233642%2fskipping-xml-elements-using-regular-expressions-in-python-3%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:



import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning





share|improve this answer
















  • 1




    Hopefully OP's incoming data doesn't ever contain any tags with attributes...
    – mypetlion
    Nov 9 at 21:58






  • 2




    @mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
    – benvc
    Nov 9 at 22:02














up vote
1
down vote



accepted










Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:



import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning





share|improve this answer
















  • 1




    Hopefully OP's incoming data doesn't ever contain any tags with attributes...
    – mypetlion
    Nov 9 at 21:58






  • 2




    @mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
    – benvc
    Nov 9 at 22:02












up vote
1
down vote



accepted







up vote
1
down vote



accepted






Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:



import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning





share|improve this answer












Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:



import re

s = '<title>Four-minute warning</title>'

title_text = re.findall(r'<title>(.+)</title>', s)

print(title_text[0])
# OUTPUT
# Four-minute warning






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 9 at 21:53









benvc

3,2821319




3,2821319







  • 1




    Hopefully OP's incoming data doesn't ever contain any tags with attributes...
    – mypetlion
    Nov 9 at 21:58






  • 2




    @mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
    – benvc
    Nov 9 at 22:02












  • 1




    Hopefully OP's incoming data doesn't ever contain any tags with attributes...
    – mypetlion
    Nov 9 at 21:58






  • 2




    @mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
    – benvc
    Nov 9 at 22:02







1




1




Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58




Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58




2




2




@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02




@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233642%2fskipping-xml-elements-using-regular-expressions-in-python-3%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo