Extracting specific page links from a <a href tag using BeautifulSoup

up vote
1
down vote

favorite

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'

#opening up connecting, grabbing the page
uClient = uReq(my_url)

# put all the content in a variable
page_html = uClient.read()

#close the internet connection
uClient.close()

#It does my HTML parser
page_soup = soup(page_html, "html.parser")

# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))

for container in containers:
 link = container
 #start_index = link.index('href="') 
 print(link)
 print("---")
 #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

I am having a hard time getting the following type of output (example of the image):
"Allegro - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/sonata&file=sonata01-1.krn&format=info"

In other words, at this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

up vote
1
down vote

favorite

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'

#opening up connecting, grabbing the page
uClient = uReq(my_url)

# put all the content in a variable
page_html = uClient.read()

#close the internet connection
uClient.close()

#It does my HTML parser
page_soup = soup(page_html, "html.parser")

# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))

for container in containers:
 link = container
 #start_index = link.index('href="') 
 print(link)
 print("---")
 #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

up vote
1
down vote

favorite

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'

#opening up connecting, grabbing the page
uClient = uReq(my_url)

# put all the content in a variable
page_html = uClient.read()

#close the internet connection
uClient.close()

#It does my HTML parser
page_soup = soup(page_html, "html.parser")

# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))

for container in containers:
 link = container
 #start_index = link.index('href="') 
 print(link)
 print("---")
 #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

I am using BeautifulSoup to extract all the links from this page: http://kern.humdrum.org/search?s=t&keyword=Haydn

I am getting all these links this way:

# -*- coding: utf-8 -*-

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'

#opening up connecting, grabbing the page
uClient = uReq(my_url)

# put all the content in a variable
page_html = uClient.read()

#close the internet connection
uClient.close()

#It does my HTML parser
page_soup = soup(page_html, "html.parser")

# Grab all of the links
containers = page_soup.findAll('a', href=True)
#print(type(containers))

for container in containers:
 link = container
 #start_index = link.index('href="') 
 print(link)
 print("---")
 #print(start_index)

part of my output is:
enter image description here

Notice that it is returning several links but I really want all the ones with >Someting. (For example, ">Allegro" and "Allegro vivace" and so forth).

python beautifulsoup

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

edited Nov 11 at 4:38

asked Nov 11 at 4:08

Fabio Soares

759

asked Nov 11 at 4:08

Fabio Soares

759

asked Nov 11 at 4:08

Fabio Soares

759

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

Do you need to use beautifulsoup? If use an html parser that allows xpath expressions, this can be much easier. See here
– bunji
Nov 11 at 4:20

@bunji, I don't need to use it. I just saw that online must people use beautifulsoup this is why I followed. I will check alternative ways, thanks.
– Fabio Soares
Nov 11 at 4:25

add a comment |

4 Answers
4

active

oldest

votes

up vote
3
down vote

accepted

From what I can see in the image the ones with info have an href attribute containing format="info" so you could use an attribute=value CSS selector of [href*=format="info"] , where the * indicates contains; the attribute value contains the substring after the first equals.

import bs4 , requests

res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
 print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

29.4k81841

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1544

add a comment |

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:
 text = link.get_text().lower()
 if any(text.endswith(substr) for substr in substring_list):
 print(link)
 print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:
 link = container
 # match exact
 #if 'Allegro di molto' == link.text:
 if 'Allegro' in link.text: # contain
 print(link)
 print("---")

answered Nov 11 at 10:12

ewwink

9,34922236

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53245761%2fextracting-specific-page-links-from-a-a-href-tag-using-beautifulsoup%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
3
down vote

accepted

import bs4 , requests

res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
 print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

29.4k81841

add a comment |

up vote
3
down vote

accepted

import bs4 , requests

res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
 print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

29.4k81841

add a comment |

up vote
3
down vote

accepted

import bs4 , requests

res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
 print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

29.4k81841

import bs4 , requests

res = requests.get("http://kern.humdrum.org/search?s=t&keyword=Haydn")
soup = bs4.BeautifulSoup(res.text,"html.parser")
for link in soup.select('[href*=format="info"]'):
 print(link.getText(), link['href'])

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

29.4k81841

edited Nov 11 at 13:41

answered Nov 11 at 5:16

QHarr

29.4k81841

answered Nov 11 at 5:16

QHarr

29.4k81841

answered Nov 11 at 5:16

QHarr

29.4k81841

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1544

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1544

add a comment |

up vote
1
down vote

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1544

The best and easiest way is using text attribute when printing the link. like this :
print link.text

answered Nov 11 at 5:22

Ali Kargar

1544

answered Nov 11 at 5:22

Ali Kargar

1544

answered Nov 11 at 5:22

Ali Kargar

1544

answered Nov 11 at 5:22

Ali Kargar

1544

add a comment |

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:
 text = link.get_text().lower()
 if any(text.endswith(substr) for substr in substring_list):
 print(link)
 print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:
 text = link.get_text().lower()
 if any(text.endswith(substr) for substr in substring_list):
 print(link)
 print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:
 text = link.get_text().lower()
 if any(text.endswith(substr) for substr in substring_list):
 print(link)
 print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

Assuming you already have a list of the substrings you need to search for, you can do something like:

for link in containers:
 text = link.get_text().lower()
 if any(text.endswith(substr) for substr in substring_list):
 print(link)
 print('---')

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

edited Nov 11 at 4:52

answered Nov 11 at 4:21

eicksl

29427

answered Nov 11 at 4:21

eicksl

29427

answered Nov 11 at 4:21

eicksl

29427

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

This would not do what I need. The whole point of me scraping the page is to try to automatic grab all the links. If I filter for "allegro", this would be only one link.
– Fabio Soares
Nov 11 at 4:26

Not sure what you're trying to do then. That loop printed a whole bunch of anchor tags for me. Could you provide more context about your problem?
– eicksl
Nov 11 at 4:30

Sure. At this point, I have a bunch of anchor tags (+- 1000). From all these tags there are a bunch that are just "trash" and +- 350 of tags that I would like to extract. All these tags look almost the same but the only difference is that the tags that I need have a "> Somebody's name<a>" at the end. I would like to exctract only the link of all the anchor tags with this characteristic.
– Fabio Soares
Nov 11 at 4:37

So in other words, you need to find all tags whose text ends with a certain substring? Or are there multiple different substrings you need to search for?
– eicksl
Nov 11 at 4:40

There are multiple substrings that I need to search for. For example, it could be "Allegro con brio" or "Presto" or... The only way to identify this tag is in the end of the tag I would have something like "> Somebody's name<a>" (i.e. "> Presto<a>")
– Fabio Soares
Nov 11 at 4:45

|
show 1 more comment

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:
 link = container
 # match exact
 #if 'Allegro di molto' == link.text:
 if 'Allegro' in link.text: # contain
 print(link)
 print("---")

answered Nov 11 at 10:12

ewwink

9,34922236

add a comment |

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:
 link = container
 # match exact
 #if 'Allegro di molto' == link.text:
 if 'Allegro' in link.text: # contain
 print(link)
 print("---")

answered Nov 11 at 10:12

ewwink

9,34922236

add a comment |

up vote
0
down vote

you want to extract link with specified anchor text?

for container in containers:
 link = container
 # match exact
 #if 'Allegro di molto' == link.text:
 if 'Allegro' in link.text: # contain
 print(link)
 print("---")

answered Nov 11 at 10:12

ewwink

9,34922236

you want to extract link with specified anchor text?

for container in containers:
 link = container
 # match exact
 #if 'Allegro di molto' == link.text:
 if 'Allegro' in link.text: # contain
 print(link)
 print("---")

answered Nov 11 at 10:12

ewwink

9,34922236

answered Nov 11 at 10:12

ewwink

9,34922236

answered Nov 11 at 10:12

ewwink

9,34922236

answered Nov 11 at 10:12

ewwink

9,34922236

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb