Skipping XML elements using Regular Expressions in Python 3
up vote
1
down vote
favorite
I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
python regex
add a comment |
up vote
1
down vote
favorite
I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
python regex
1
Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12
I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27
You could read the file as sting and wrap the contents into a root tag.valid_xml = f'<document>xml_file_contents</document>'
. Then use the result as input for ElementTree.
– Johan Wentholt
Nov 9 at 23:51
@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any<title>
tag inside a pair of<title>...</title>
tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible<title>
tags (e.g.<title xmlns:blabla="...">
)
– Luis Colorado
Nov 15 at 10:06
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
python regex
I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("nMatches found!n")
for title in title_text:
print(title)
else:
print("nNo matches found!nn")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
python regex
python regex
asked Nov 9 at 21:47
Arun
110218
110218
1
Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12
I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27
You could read the file as sting and wrap the contents into a root tag.valid_xml = f'<document>xml_file_contents</document>'
. Then use the result as input for ElementTree.
– Johan Wentholt
Nov 9 at 23:51
@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any<title>
tag inside a pair of<title>...</title>
tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible<title>
tags (e.g.<title xmlns:blabla="...">
)
– Luis Colorado
Nov 15 at 10:06
add a comment |
1
Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12
I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27
You could read the file as sting and wrap the contents into a root tag.valid_xml = f'<document>xml_file_contents</document>'
. Then use the result as input for ElementTree.
– Johan Wentholt
Nov 9 at 23:51
@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any<title>
tag inside a pair of<title>...</title>
tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible<title>
tags (e.g.<title xmlns:blabla="...">
)
– Luis Colorado
Nov 15 at 10:06
1
1
Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12
Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12
I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27
I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27
You could read the file as sting and wrap the contents into a root tag.
valid_xml = f'<document>xml_file_contents</document>'
. Then use the result as input for ElementTree.– Johan Wentholt
Nov 9 at 23:51
You could read the file as sting and wrap the contents into a root tag.
valid_xml = f'<document>xml_file_contents</document>'
. Then use the result as input for ElementTree.– Johan Wentholt
Nov 9 at 23:51
@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any
<title>
tag inside a pair of <title>...</title>
tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title>
tags (e.g. <title xmlns:blabla="...">
)– Luis Colorado
Nov 15 at 10:06
@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any
<title>
tag inside a pair of <title>...</title>
tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible <title>
tags (e.g. <title xmlns:blabla="...">
)– Luis Colorado
Nov 15 at 10:06
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Just use a capture group in your regex (re.findall()
takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
1
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
2
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Just use a capture group in your regex (re.findall()
takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
1
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
2
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
add a comment |
up vote
1
down vote
accepted
Just use a capture group in your regex (re.findall()
takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
1
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
2
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Just use a capture group in your regex (re.findall()
takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
Just use a capture group in your regex (re.findall()
takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
answered Nov 9 at 21:53
benvc
3,2821319
3,2821319
1
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
2
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
add a comment |
1
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
2
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
1
1
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
Hopefully OP's incoming data doesn't ever contain any tags with attributes...
– mypetlion
Nov 9 at 21:58
2
2
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
@mypetlion you are right to comment for OP's benefit or future readers that regex is often not the best tool for parsing XML unless you have a fairly complete knowledge of how the input XML is constructed. Otherwise, check out ElementTree or something similar.
– benvc
Nov 9 at 22:02
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233642%2fskipping-xml-elements-using-regular-expressions-in-python-3%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Don't use regex to parse XML.
– Johan Wentholt
Nov 9 at 22:12
I think I am forced to use regex to parse the XML file as the XML file contains more than a single root node/element (document root). As a result, ElementTree throws and error.
– Arun
Nov 9 at 23:27
You could read the file as sting and wrap the contents into a root tag.
valid_xml = f'<document>xml_file_contents</document>'
. Then use the result as input for ElementTree.– Johan Wentholt
Nov 9 at 23:51
@Arun, Johan is telling you not to use regular expressions to parse XML because XML is not a regular language. You can assume your language is regular (and you'll get a valid regexp) only in case you never process any
<title>
tag inside a pair of<title>...</title>
tags, which is permitted by XML. On other side, XML syntax is too complex to get a simple regexp to isolate all cases of possible<title>
tags (e.g.<title xmlns:blabla="...">
)– Luis Colorado
Nov 15 at 10:06