I have to parse only the positions and points from this link. That link has 21 listings (I don't know actually what to call them) on it and each listing has 40 players on it expect the last one. Now I have written a code which is like this,
but the problem is it always gives me the first listing's points and positions, It ignored the other 20. How could I get the positions and points for the whole 21? Now I heard scrapy can do this type thing pretty easy but I am not fully familiar with scrapy. Is there any other way possible than using scrapy.
解决方案
This is happening because these links are handled by the server, and often the portion of the link followed by the # symbol, called the fragment identifier, is processed by the browser and refers to some link or javascript behavior, i.e. loading a different set of results.
I would suggest two appraoches, either finding a way to use a link which the server can evaluate that you could continue using scrapy with or using a webdriver like selenium.
Scrapy
Your first step is to identify the javascript load call, often ajax, and use those links to pull your information. These are calls to the site's DB. This can be done by opening your web inspector and watching the network traffic as you click the next search result page:
This url returns a json file which can be parsed, and you can even shorten your steps are it looks like you can control more what information is returned to you.
You could either write a method to generate a series of links for you:
I changed the end the pageSize parameter to 200 since it seems this corresponds directly to the number of results returned.
NOTE There is a chance this method would not work as sites sometimes block their data API from outside usage via screening the ip of where the request is coming from.
If this is the case you should go with the following approach.
Selenium (or other webdriver)
Using something like selenium which is a webdriver, you can use what is loaded into a browser to evaluate data that is loaded after the server has returned the webpage.
There is some initial setup that needs to be set up in order for selenium to be usable, but it is a very powerful tool once you have it working.
A simple example of this would be:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
You will see a python controlled Firefox browser (this can be done with other browsers too) open on your screen and load the url you provide, then follow the commands you give it which can be done from a shell even (useful for debugging) and you can search and parse html in the same way you would with scrapy (code contd from previous code section...)
If you want to perform something like clicking the next page button:
That expression may need some tweaking but it intends to find all the li elements of class='page' that are in the div with class='pagination', the // means shortened path between elements, your other alternative would be like /html/body/div/div/..... until you get to the one in question, which is why //div/... is useful and appealing.
For specific help and reference on locating elements see their page
My usual method is trial and error for this, tweaking the expression until it hits the target elements I want. This is where the console/shell comes in handy. After setting up the driver as above, I usually try and build my expression:
and then one by one click them, scrape the new data, click the next one:
import time
from selenium import webdriver
driver = None
try:
driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")
#
# Scrape the first page
#
links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")
for link in links:
link.click()
#
# scrape the next page
#
time.sleep(1) # pause for a time period to let the data load
finally:
if driver:
driver.close()
It is best to wrap it all in a try...finally type block to make sure you close the driver instance.
If you decide to delve deeper into the selenium approach, you can refer to their docs which have excellent and very explicit documentation and examples.