Extracting Urls
I am extracting URLs that lead to test cricketers’ profiles from here.
Now that’s about 2682 players, 50 players a page for 54 pages!! I’d like to get details like Date of birth, place of birth, Major teams, batting styles, bowling styles etc.
This is how the html looks like. We are going to be extracting the links (circled in red).
Here’s the piece of code which does the job!
import lxml.html
base_url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;template=results;type=batting'
content = lxml.html.parse(base_url)
links = content.xpath('//tr[@class="data1"]/td[1]/a/@href')
player_links = [player for player in links]
print player_links
-
First import the lxml library
import lxml.html
-
Store the root url in a variable
base_url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;template=results;type=batting'
-
Parse the url
content = lxml.html.parse(base_url)
-
Extract the link using xpath
links = content.xpath('//tr[@class="data1"]/td[1]/a/@href')
-
Store the extracted links in a list
player_links = [player for player in links]
Wait a minute!!! The above piece of code would give us the links for the first 50 players in the first page. But we’ve got 53 pages more to go!!! Lets do something about that.
import lxml.html
base_url = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;template=results;type=batting'
url_part1 = 'http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;'
url_part2 = 'page='
url_part3 = 'template=results;type=batting'
url_list = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=1;template=results;type=batting']
for i in range(2,55):
url_next = url_part1 + url_part2 + str(i) + ';' + url_part3
url_list.append(url_next)
players = [ ]
for page in url_list:
content = lxml.html.parse(page)
links = content.xpath('//tr[@class="data1"]/td[1]/a/@href')
player_links = [player for player in links]
players.extend(player_links)
print players
Now that we’ve captured the player profile URLs for 2682 players, lets see what else we could do with them in the forthcoming posts.