Scrapy hxs.select() 不选择所有结果 [英] Scrapy hxs.select() not selecting all results
问题描述
我正在尝试从这里抓取赔率.
I am trying to scrapy to scrape odds from here.
目前只是尝试使用以下蜘蛛记录结果:
Currently just trying to log the results with the following spider :
def parse(self, response):
log.start("LogFile.txt", log.DEBUG);
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="fb_day_type_wrapper"]')
items = []
for site in sites:
siteAddress = urlparse.urljoin(response.url, site.extract())
self.log('Found category url: %s' % siteAddress)
这仅记录条目:此市场当前不可用....不是包含赔率的其他元素.
This only logs the entry: This market is currently unavailable.... Not the other elements which contain the odds.
我尝试了几个不同的选择器,但都没有成功.看起来一旦我尝试进入元素 div[@class="fb_day_type_wrapper"]
我什么也没有返回.我使用scrapy shell也有同样的结果.
I have tried a few different selectors with no luck. It looks like once I try and get inside of the element div[@class="fb_day_type_wrapper"]
I get nothing returned. I have the same results using the scrapy shell.
推荐答案
该站点使用 javascript 生成数据表.有一些替代方案,例如 scrapyjs 或 splash 允许获取 js 渲染的 html 页面.如果您只需要抓取一页,则最好使用 Selenium.
The site uses javascript to generate the data table. There are some alternatives like scrapyjs or splash that allows to get the js-rendered html page. If you only need to scrape one page, you might be better off using Selenium.
否则,您可能需要进入核心模式并使用数据对站点中发生的事情进行逆向工程.我会告诉你怎么做.
Otherwise, you might need to go into hardcore mode and reverse engineer what is going on in the site with the data. I will show you how to do that.
首先,启动 scrapy shell
以便我们可以浏览网页:
First, start the scrapy shell
so we can explore the web page:
scrapy shell http://www.paddypower.com/football/football-matches/premier-league
注意:我使用的是 python 2.7.4、ipython 0.13.2 和 scrapy 0.18.0.
Note: I'm using python 2.7.4, ipython 0.13.2 and scrapy 0.18.0.
如果您在浏览器中查找Crystal Palace v Fulham"的源代码,您将看到包含该引用的 javascript 代码. 块看起来像:
If you lookup in the source for "Crystal Palace v Fulham" in your browser, you will see there is a javascript code that have that reference. The <script>
block looks like:
document.bodyOnLoad.push(function() {
lb_fb_cpn_init(
"",
"html",
"MR_224",
{category: 'SOCCER',
我们在 shell 中查找这个元素:
We lookup in the shell for this element:
In [1]: hxs.select('//script[contains(., "lb_fb_cpn_init")]')
Out[1]: [<HtmlXPathSelector xpath='//script[contains(., "lb_fb_cpn_init")]' data=u'<script type="text/javascript">\n/* $Id: '>]
如果您查找 lb_fb_cpn_init
参数,您将看到我们正在寻找的数据以这种形式作为参数传递:
If you lookup into the lb_fb_cpn_init
arguments, you will see the data we are looking for is passed as an argument in this form:
[{names: {en: 'Newcastle v Liverpool'}, ...
实际上有这样的三个论点:
In fact there are three arguments like that:
In [2]: hxs.select('//script[contains(., "lb_fb_cpn_init")]').re('\[{names:')
Out[2]: [u'[{names:', u'[{names:', u'[{names:']
所以我们把它们全部提取出来,注意我们使用了很多正则表达式:
So we extract all of them, notice that we use a lot of regular expressions:
In [3]: js_args = hxs.select('//script[contains(., "lb_fb_cpn_init")]').re(r'(\[{names:(?:.+?)\]),')
In [4]: len(js_args)
Out[4]: 3
这里的想法是我们想要将 javascript 代码(它是一个文字对象)解析为 python 代码(一个 dict).我们可以使用 json.loads
但要这样做,js 代码必须是一个有效的 json 对象,即在 ""
中包含字段名称和字符串.
The idea here is that we want to parse the javascript code (which is a literal object) into python code (a dict). We could use json.loads
but to do so the js code must be a valid json object, that is, have field names and strings enclosed in ""
.
我们继续这样做.首先,我将单个字符串中的参数作为 javascript 列表加入:
We proceed to do so. First I join the arguments in a single string as a javascript list:
In [5]: args_raw = '[{}]'.format(', '.join(js_args))
然后我们将字段名称括在 ""
和 用双引号替换为单引号:
Then we enclose the field names into ""
and replace with single quotes with double quotes:
In [6]: import re
In [7]: args_json = re.sub(r'(,\s?|{)(\w+):', r'\1"\2":', args_raw).replace("'", '"')
这可能并不总是适用于所有情况,因为 javascript 代码可能具有不容易用单个 re.sub
和/或 .replace
替换的模式.
This might not always work in all cases as the javascript code might have patterns that are not so easy to replace with a single re.sub
and/or .replace
.
我们准备将 javascript 代码解析为 json 对象:
We are ready to parse the javascript code as a json object:
In [8]: import json
In [9]: data = json.loads(args_json)
In [10]: len(data)
Out[10]: 3
在这里,我只是在寻找事件名称和赔率.您可以查看 data
内容以了解其外观.
Here, I'm just looking for the event name and odds. You can take a look to the data
content to see what it looks like.
幸运的是,数据似乎具有相关性:
Luckily, the data seems to have a correlation:
In [11]: map(len, data)
Out[11]: [20, 20, 60]
您也可以使用 ev_id
字段从它们三个中构建一个 dict
.我将假设 data[0]
和 data[1]
具有直接相关性,并且 data[2]
每个事件包含 3 个项目.这可以通过以下方式轻松验证:
You could as well build a single dict
from the three of them by using the ev_id
field. I will just assume that data[0]
and data[1]
hava a direct correlation and that data[2]
contains 3 items per event. This can be easily verified with:
In [12]: map(lambda v: v['ev_id'], data[2])
Out [12]:
[5889932,
5889932,
5889932,
5889933,
5889933,
5889933,
...
使用一些python-fu,我们可以合并记录:
With some python-fu, we can merge the records:
In [13]: odds = iter(data[2])
In [14]: odds_merged = zip(odds, odds, odds)
In [15]: data_merged = zip(data[0], data[1], odds_merged)
In [16]: len(data_merged)
Out[16]: 20
最后,我们收集数据:
In [17]: get_odd = lambda obj: (obj['names']['en'], '/'.join([obj['lp_num'], obj['lp_den']]))
In [18]: event_odds = []
In [19]: for event, _, odds in data_merged:
....: event_odds.append({'name': event['names']['en'], 'odds': dict(map(get_odd, odds)), 'url': event['url']})
....:
In [20]: event_odds
Out[20]:
[{'name': u'Newcastle v Liverpool',
'odds': {u'Draw': u'14/5', u'Liverpool': u'17/20', u'Newcastle': u'3/1'},
'url': u'http://www.paddypower.com/football/football-matches/premier-league-matches/Newcastle%2dv%2dLiverpool-5889932.html'},
{'name': u'Arsenal v Norwich',
'odds': {u'Arsenal': u'3/10', u'Draw': u'9/2', u'Norwich': u'9/1'},
'url': u'http://www.paddypower.com/football/football-matches/premier-league-matches/Arsenal%2dv%2dNorwich-5889933.html'},
{'name': u'Chelsea v Cardiff',
'odds': {u'Cardiff': u'10/1', u'Chelsea': u'1/4', u'Draw': u'5/1'},
'url': u'http://www.paddypower.com/football/football-matches/premier-league-matches/Chelsea%2dv%2dCardiff-5889934.html'},
{'name': u'Everton v Hull',
'odds': {u'Draw': u'10/3', u'Everton': u'4/9', u'Hull': u'13/2'},
'url': u'http://www.paddypower.com/football/football-matches/premier-league-matches/Everton%2dv%2dHull-5889935.html'},
{'name': u'Man Utd v Southampton',
'odds': {u'Draw': u'3/1', u'Man Utd': u'8/15', u'Southampton': u'11/2'},
'url': u'http://www.paddypower.com/football/football-matches/premier-league-matches/Man%2dUtd%2dv%2dSouthampton-5889939.html'},
...
如您所见,网页抓取非常具有挑战性(而且很有趣!).这完全取决于网站如何显示数据.在这里,您可以仅使用 Selenium 来节省时间,但如果您想抓取大型网站,Selenium 与 Scrapy 相比会非常慢.
As you can see, web scraping can be very challenging (and fun!). All it depends how the website displays the data. Here you could save time by just using Selenium, but if you are looking to scrape a large website, Selenium will be very slow compared to Scrapy.
此外,您还必须考虑站点是否会经常更新代码,在这种情况下,您将花费更多时间对 js 代码进行逆向工程.在这种情况下,像 scrapyjs 或 splash 可能是更好的选择.
Also you have to consider whether the site will get code updates very often, in that case you will spend more time reverse engineering the js code. In that case a solution like scrapyjs or splash can be a better option.
结束语:
- 现在您拥有提取数据所需的所有代码.您需要将其集成到您的蜘蛛回调中并构建您的项目.
- 不要使用
log.start
.使用设置LOG_FILE
(命令行参数:--set LOG_FILE=mylog.txt
). - 记住
.extract()
总是返回一个列表.
- Now you have all the code required to extract the data. You need to integrate this into your spider callback and build your item.
- Don't use
log.start
. Use the settingLOG_FILE
(command line argument:--set LOG_FILE=mylog.txt
). - Remeber that
.extract()
always returns a list.
这篇关于Scrapy hxs.select() 不选择所有结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!