Scrapy hxs.select() 不选择所有结果 [英] Scrapy hxs.select() not selecting all results

查看:43
本文介绍了Scrapy hxs.select() 不选择所有结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从这里抓取赔率.

I am trying to scrapy to scrape odds from here.

目前只是尝试使用以下蜘蛛记录结果:

Currently just trying to log the results with the following spider :

def parse(self, response):         
   log.start("LogFile.txt", log.DEBUG);

   hxs = HtmlXPathSelector(response)
   sites = hxs.select('//div[@class="fb_day_type_wrapper"]')

   items = []
   for site in sites:
       siteAddress = urlparse.urljoin(response.url, site.extract())
       self.log('Found category url: %s' % siteAddress)

这仅记录条目:此市场当前不可用....不是包含赔率的其他元素.

This only logs the entry: This market is currently unavailable.... Not the other elements which contain the odds.

我尝试了几个不同的选择器,但都没有成功.看起来一旦我尝试进入元素 div[@class="fb_day_type_wrapper"] 我什么也没有返回.我使用scrapy shell也有同样的结果.

I have tried a few different selectors with no luck. It looks like once I try and get inside of the element div[@class="fb_day_type_wrapper"] I get nothing returned. I have the same results using the scrapy shell.

推荐答案

该站点使用 javascript 生成数据表.有一些替代方案,例如 scrapyjssplash 允许获取 js 渲染的 html 页面.如果您只需要抓取一页,则最好使用 Selenium.

The site uses javascript to generate the data table. There are some alternatives like scrapyjs or splash that allows to get the js-rendered html page. If you only need to scrape one page, you might be better off using Selenium.

否则,您可能需要进入核心模式并使用数据对站点中发生的事情进行逆向工程.我会告诉你怎么做.

Otherwise, you might need to go into hardcore mode and reverse engineer what is going on in the site with the data. I will show you how to do that.

首先,启动 scrapy shell 以便我们可以浏览网页:

First, start the scrapy shell so we can explore the web page:

scrapy shell http://www.paddypower.com/football/football-matches/premier-league

注意:我使用的是 python 2.7.4、ipython 0.13.2 和 scrapy 0.18.0.

Note: I'm using python 2.7.4, ipython 0.13.2 and scrapy 0.18.0.

如果您在浏览器中查找Crystal Palace v Fulham"的源代码,您将看到包含该引用的 javascript 代码.

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆