如何使用scrapy从javascript实现的多页中抓取数据 [英] How to use scrapy to crawl data from multipages which are implemented by javascript
问题描述
我想用scrapy抓取网页数据,但是从url看不到不同页面的区别.例如:
I want to use scrapy to crawl data from webpages, but the difference between different pages can't be seen from the url.For example:
http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky
上面的url是我要抓取数据的第一个页面,很容易从中获取数据.
The url as above is the first page which I want to crawl data from, and it's easy to get data from it.
这是我的代码:
__author__ = 'Rabbit'
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy_Data.items import EPGD
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
stmp = []
term = "man"
url_base = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"
start_urls = stmp
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item
但是如果我想从第2页获取数据就会出现问题.我点击下一页,第二页的url是这样的:
But the problem comes out if I want to get data from page 2.I click next page, and the url of second page looks like this:
http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=20
正如你所见,它的url中没有关键字,所以我不知道如何从其他页面获取数据.也许我应该使用 cookie,但我不知道如何处理这种情况,所以有人可以帮助我.
Just as you see, it doesn't have a keyword in its url, so I don't know how to get data from other pages. Maybe I should use cookies, but I don't know how to do with this situation, so can anyone help me.
非常感谢!
推荐答案
当链接解析和请求屈服被添加到你的 parse() 函数时,你的例子只对我有用.也许该页面使用了一些服务器端 cookie.但是使用像 Scrapy's Crawlera(从多个 IP 下载)这样的代理服务却失败了.
When link parsing and Request yielding is added to your parse() function, your example just works for me. Maybe the page uses some server-side cookies. But using a proxy service like Scrapy's Crawlera (which downloads from multiple IPs) it fails though.
解决办法是在请求url中手动输入'textquery'参数:
The solution is to enter the 'textquery' parameter manually into the request url:
import urlparse
from urllib import urlencode
from scrapy import Request
from scrapy.spiders import Spider
from scrapy.selector import Selector
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = 'calb'
base_url = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=0&textquery=%s"
start_urls = [base_url % term]
def update_url(self, url, params):
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urlencode(query)
url = urlparse.urlunparse(url_parts)
return url
def parse(self, response):
sel = Selector(response)
genes = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for gene in genes:
item = {}
item['genID'] = map(unicode.strip, gene.xpath('td[1]/a/text()').extract())
# ...
yield item
urls = sel.xpath('//div[@id="nviRecords"]/span[@id="quickPage"]/a/@href').extract()
for url in urls:
url = response.urljoin(url)
url = self.update_url(url, params={'textquery': self.term})
yield Request(url)
来自 Lukasz 解决方案的 update_url() 函数详细信息:
在 Python 中向给定 URL 添加参数
update_url() function details from Lukasz' solution:
Add params to given URL in Python
这篇关于如何使用scrapy从javascript实现的多页中抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!