如何使用scrapy从javascript实现的多页中抓取数据 [英] How to use scrapy to crawl data from multipages which are implemented by javascript

查看:24
本文介绍了如何使用scrapy从javascript实现的多页中抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用scrapy抓取网页数据,但是从url看不到不同页面的区别.例如:

I want to use scrapy to crawl data from webpages, but the difference between different pages can't be seen from the url.For example:

http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky

上面的url是我要抓取数据的第一个页面,很容易从中获取数据.

The url as above is the first page which I want to crawl data from, and it's easy to get data from it.

这是我的代码:

__author__ = 'Rabbit'
from scrapy.spiders import Spider
from scrapy.selector import Selector

from scrapy_Data.items import EPGD


class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    stmp = []
    term = "man"
    url_base = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"

    start_urls = stmp

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for site in sites:
            item = EPGD()
            item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
            item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
            item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
            item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
            item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
            item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
            yield item

但是如果我想从第2页获取数据就会出现问题.我点击下一页,第二页的url是这样的:

But the problem comes out if I want to get data from page 2.I click next page, and the url of second page looks like this:

http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=20

正如你所见,它的url中没有关键字,所以我不知道如何从其他页面获取数据.也许我应该使用 cookie,但我不知道如何处理这种情况,所以有人可以帮助我.

Just as you see, it doesn't have a keyword in its url, so I don't know how to get data from other pages. Maybe I should use cookies, but I don't know how to do with this situation, so can anyone help me.

非常感谢!

推荐答案

当链接解析和请求屈服被添加到你的 parse() 函数时,你的例子只对我有用.也许该页面使用了一些服务器端 cookie.但是使用像 Scrapy's Crawlera(从多个 IP 下载)这样的代理服务却失败了.

When link parsing and Request yielding is added to your parse() function, your example just works for me. Maybe the page uses some server-side cookies. But using a proxy service like Scrapy's Crawlera (which downloads from multiple IPs) it fails though.

解决办法是在请求url中手动输入'textquery'参数:

The solution is to enter the 'textquery' parameter manually into the request url:

import urlparse
from urllib import urlencode

from scrapy import Request
from scrapy.spiders import Spider
from scrapy.selector import Selector


class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = 'calb'
    base_url = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=0&textquery=%s"
    start_urls = [base_url % term]

    def update_url(self, url, params):
        url_parts = list(urlparse.urlparse(url))
        query = dict(urlparse.parse_qsl(url_parts[4]))
        query.update(params)
        url_parts[4] = urlencode(query)
        url = urlparse.urlunparse(url_parts)
        return url

    def parse(self, response):
        sel = Selector(response)
        genes = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for gene in genes:
            item = {}
            item['genID'] = map(unicode.strip, gene.xpath('td[1]/a/text()').extract())
            # ...
            yield item

        urls = sel.xpath('//div[@id="nviRecords"]/span[@id="quickPage"]/a/@href').extract()
        for url in urls:
            url = response.urljoin(url)
            url = self.update_url(url, params={'textquery': self.term})
            yield Request(url)

来自 Lukasz 解决方案的 update_url() 函数详细信息:
在 Python 中向给定 URL 添加参数

update_url() function details from Lukasz' solution:
Add params to given URL in Python

这篇关于如何使用scrapy从javascript实现的多页中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆