使用 Scrapy 的网页抓取工具 [英] Web Scraper Using Scrapy

查看:29
本文介绍了使用 Scrapy 的网页抓取工具的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只需要解析,由浏览器处理并引用某个链接或javascript 行为,即加载不同的结果集.

我建议使用两种方法,或者找到一种使用服务器可以评估的链接的方法,您可以继续使用 scrapy 或使用像 selenium 这样的网络驱动程序.

Scrapy

您的第一步是识别 javascript 加载调用,通常是 ajax,并使用这些链接来提取您的信息.这些是对站点数据库的调用.这可以通过打开您的网络检查器并在您点击下一个搜索结果页面时观察网络流量来完成:

然后点击之后

我们可以看到有一个新的调用 this url:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40

这个 url 返回一个可以解析的 json 文件,你甚至可以缩短你的步骤,看起来你可以控制更多的信息返回给你.

您可以编写一个方法来为您生成一系列链接:

def gen_url(page_no):返回 "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"

然后,例如,将 scrapy 与种子列表一起使用:

seed = [gen_url(i) for i in range(20)]

或者你可以尝试调整 url 参数,看看你得到了什么,也许你可以一次得到多个页面:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200

我将 pageSize 参数的末尾更改为 200,因为这似乎直接对应于返回的结果数.

注意 这种方法有可能不起作用,因为网站有时会通过筛选请求来源的 ip 来阻止外部使用其数据 API.

如果是这种情况,您应该采用以下方法.

Selenium(或其他网络驱动程序)

使用像 selenium 之类的东西,这是一个网络驱动程序,你可以使用加载到浏览器中的内容来评估服务器返回网页后加载的数据.

有一些初始设置需要进行设置才能使 selenium 可用,但一旦您使用它,它就是一个非常强大的工具.

一个简单的例子是:

from selenium import webdriver驱动程序 = webdriver.Firefox()driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")

您将在屏幕上看到一个由 py​​thon 控制的 Firefox 浏览器(这也可以用其他浏览器完成)并加载您提供的 url,然后按照您提供的命令执行,这些命令甚至可以从 shell 完成(对于调试),您可以使用与 scrapy 相同的方式搜索和解析 html(代码从前面的代码部分继续......)

如果您想执行诸如单击下一页按钮之类的操作:

driver.find_elements_by_xpath("//div[@class='分页']//li[@class='page']")

该表达式可能需要一些调整,但它打算找到 class='page'div 的所有 li 元素使用 class='pagination' 时,// 表示元素之间的缩短路径,您的另一种选择是 /html/body/div/div/..... 直到你找到有问题的那个,这就是为什么 //div/... 有用且吸引人的原因.

有关定位元素的具体帮助和参考,请参阅他们的页面

我通常的方法是反复试验,调整表达式直到它命中我想要的目标元素.这是控制台/外壳派上用场的地方.如上所述设置driver后,我通常会尝试构建我的表达式:

假设您有一个 html 结构,例如:

<头></头><身体><div id="容器"><div id="info-i-want">百宝箱

</html>

我会从以下内容开始:

<预><代码>>>>print driver.get_element_by_xpath("//body")'<身体><div id="容器"><div id="info-i-want">百宝箱

'>>>print driver.get_element_by_xpath("//div[@id='container']")<div id="容器"><div id="info-i-want">百宝箱

>>>print driver.get_element_by_xpath("//div[@id='info-i-want']")<div id="info-i-want">百宝箱

>>>print driver.get_element_by_xpath("//div[@id='info-i-want']/text()")百宝箱>>>#繁荣宝藏!

通常它会更复杂,但这是一个很好且经常需要的调试策略.

回到你的情况,然后你可以将它们保存到一个数组中:

links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

然后一个一个点击它们,抓取新数据,点击下一个:

导入时间从硒导入网络驱动程序司机 = 无尝试:驱动程序 = webdriver.Firefox()driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")## 抓取第一页#links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")对于链接中的链接:链接.点击()## 抓取下一页#time.sleep(1) # 暂停一段时间让数据加载最后:如果司机:驱动程序关闭()

最好将其全部包装在 try...finally 类型块中,以确保关闭驱动程序实例.

如果您决定深入研究 selenium 方法,您可以参考他们的文档 具有出色且非常明确的文档和 例子.

祝你刮刮乐!

I have to parse only the positions and points from this link. That link has 21 listings (I don't know actually what to call them) on it and each listing has 40 players on it expect the last one. Now I have written a code which is like this,

from bs4 import BeautifulSoup
import urllib2

def overall_standing():
    url_list = ["http://www.afl.com.au/afl/stats/player-ratings/overall-standings#", 
                "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/3",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/4",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/5",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/6",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/7",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/8",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/9",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/10",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/11",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/12",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/13",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/14",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/15",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/16",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/17",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/18",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/19",
#                 "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/20",
                "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/21"]

    gDictPlayerPointsInfo = {}
    for url in url_list:
        print url
        header = {'User-Agent': 'Mozilla/5.0'}
        header = {'User-Agent': 'Mozilla/5.0'}
        req = urllib2.Request(url,headers=header)
        page = urllib2.urlopen(req)
        soup = BeautifulSoup(page)
        table = soup.find("table", { "class" : "ladder zebra player-ratings" })

        lCount = 1
        for row in table.find_all("tr"):
            lPlayerName = ""
            lTeamName = ""
            lPosition = ""
            lPoint = ""
            for cell in row.find_all("td"):
                if lCount == 2:
                    lPlayerName = str(cell.get_text()).strip().upper()
                elif lCount == 3:
                    lTeamName = str(cell.get_text()).strip().split("\n")[-1].strip().upper()
                elif lCount == 4:
                    lPosition = str(cell.get_text().strip())
                elif lCount == 6:
                    lPoint = str(cell.get_text().strip())

                lCount += 1

            if url == "http://www.afl.com.au/afl/stats/player-ratings/overall-standings#page/2":
                print lTeamName, lPlayerName, lPoint
            if lPlayerName <> "" and lTeamName <> "":
                lStr = lPosition + "," + lPoint

#                 if gDictPlayerPointsInfo.has_key(lTeamName):
#                     gDictPlayerPointsInfo[lTeamName].append({lPlayerName:lStr})
#                 else:
                gDictPlayerPointsInfo[lTeamName+","+lPlayerName] = lStr
            lCount = 1


    lfp = open("a.txt","w")
    for key in gDictPlayerPointsInfo:
        if key.find("RICHMOND"):
            lfp.write(str(gDictPlayerPointsInfo[key]))

    lfp.close()
    return gDictPlayerPointsInfo


# overall_standing()

but the problem is it always gives me the first listing's points and positions, It ignored the other 20. How could I get the positions and points for the whole 21? Now I heard scrapy can do this type thing pretty easy but I am not fully familiar with scrapy. Is there any other way possible than using scrapy.

解决方案

This is happening because these links are handled by the server, and often the portion of the link followed by the # symbol, called the fragment identifier, is processed by the browser and refers to some link or javascript behavior, i.e. loading a different set of results.

I would suggest two appraoches, either finding a way to use a link which the server can evaluate that you could continue using scrapy with or using a webdriver like selenium.

Scrapy

Your first step is to identify the javascript load call, often ajax, and use those links to pull your information. These are calls to the site's DB. This can be done by opening your web inspector and watching the network traffic as you click the next search result page:

and then after the click

we can see that there is a new call the this url:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=3&pageSize=40

This url returns a json file which can be parsed, and you can even shorten your steps are it looks like you can control more what information is returned to you.

You could either write a method to generate a series of links for you:

def gen_url(page_no):
    return "http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=" + str(page_no) + "&pageSize=40"

and then, for example, use scrapy with the seed list:

seed = [gen_url(i) for i in range(20)]

or you can try tweaking the url parameters and see what you get, maybe you can get multiple pages at a time:

http://www.afl.com.au/api/cfs/afl/playerRatings?roundId=CD_R201401408&pageNum=1&pageSize=200

I changed the end the pageSize parameter to 200 since it seems this corresponds directly to the number of results returned.

NOTE There is a chance this method would not work as sites sometimes block their data API from outside usage via screening the ip of where the request is coming from.

If this is the case you should go with the following approach.

Selenium (or other webdriver)

Using something like selenium which is a webdriver, you can use what is loaded into a browser to evaluate data that is loaded after the server has returned the webpage.

There is some initial setup that needs to be set up in order for selenium to be usable, but it is a very powerful tool once you have it working.

A simple example of this would be:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")

You will see a python controlled Firefox browser (this can be done with other browsers too) open on your screen and load the url you provide, then follow the commands you give it which can be done from a shell even (useful for debugging) and you can search and parse html in the same way you would with scrapy (code contd from previous code section...)

If you want to perform something like clicking the next page button:

driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

That expression may need some tweaking but it intends to find all the li elements of class='page' that are in the div with class='pagination', the // means shortened path between elements, your other alternative would be like /html/body/div/div/..... until you get to the one in question, which is why //div/... is useful and appealing.

For specific help and reference on locating elements see their page

My usual method is trial and error for this, tweaking the expression until it hits the target elements I want. This is where the console/shell comes in handy. After setting up the driver as above, I usually try and build my expression:

Say you have an html structure like:

<html>
    <head></head>
    <body>
        <div id="container">
             <div id="info-i-want">
                  treasure chest
             </div>
        </div>
    </body>
</html> 

I would start with something like:

>>> print driver.get_element_by_xpath("//body")
'<body>
    <div id="container">
         <div id="info-i-want">
              treasure chest 
         </div>
    </div>
</body>'
>>> print driver.get_element_by_xpath("//div[@id='container']")
<div id="container">
     <div id="info-i-want">
          treasure chest 
     </div>
</div>
>>> print driver.get_element_by_xpath("//div[@id='info-i-want']")
<div id="info-i-want">
     treasure chest 
</div>
>>> print driver.get_element_by_xpath("//div[@id='info-i-want']/text()")
treasure chest 
>>> # BOOM TREASURE!

Usually it will be more complex, but this is a good and often necessary debugging tactic.

Back to your case, you could then save them out into an array:

links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

and then one by one click them, scrape the new data, click the next one:

import time
from selenium import webdriver

driver = None
try:
    driver = webdriver.Firefox()
    driver.get("http://www.afl.com.au/stats/player-ratings/overall-standings")

    #
    # Scrape the first page
    #

    links = driver.find_elements_by_xpath("//div[@class='pagination']//li[@class='page']")

    for link in links:
        link.click()
        #
        # scrape the next page
        #
        time.sleep(1) # pause for a time period to let the data load
finally:
    if driver:
        driver.close()

It is best to wrap it all in a try...finally type block to make sure you close the driver instance.

If you decide to delve deeper into the selenium approach, you can refer to their docs which have excellent and very explicit documentation and examples.

Happy scraping!

这篇关于使用 Scrapy 的网页抓取工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆