使用Scrapy进行Scraping时,无法在源代码中找到显示的数据 [英] Cannot locate displayed data in source code when Scraping with Scrapy

查看:114
本文介绍了使用Scrapy进行Scraping时,无法在源代码中找到显示的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows Vista 64位上使用Python.org版本2.7 64位。我正在使用Scrapy和regex的组合从以下页面中名为DataStore.Prime的Javascript项中提取信息:

I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I am using a combination of Scrapy and regex to extract information from a Javascript item called 'DataStore.Prime' at the following page:

http://www.whoscored.com/Regions/252/Tournaments/26/Seasons/4057 /阶段/ 8273
我正在使用的抓取工具是:

http://www.whoscored.com/Regions/252/Tournaments/26/Seasons/4057/Stages/8273 The crawler I am using is this:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json


class ExampleSpider(CrawlSpider):
    name = "goal4"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Regions/252/Tournaments/26"]
    download_delay = 1

    #rules = [Rule(SgmlLinkExtractor(allow=('/Seasons',)), follow=True, callback='parse_item')]
    rules = [Rule(SgmlLinkExtractor(allow=('/Tournaments/26'),deny=('/News', '/Fixtures'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

regex = re.compile('DataStore\.prime\(\'ws-stage-stat\', { stageId: \d+, type: \d+, teamId: -?\d+, against: \d+, field: \d+ }, \[\[\[.*?\]\]', re.S)

        match2h = re.search(regex, response.body)

        if match2h is not None:
            match3h = match2h.group()

            match3h = str(match3h)
            match3h = match3h \
                 .replace('title=', '').replace('"', '').replace("'", '').replace('[', '').replace(']', '') \
                 .replace(' ', ',').replace(',,', ',') \
                 .replace('[', '') \
                 .replace(']','') \
                 .replace("DataStore.prime", '') \
                 .replace('(', ''). replace('-', '').replace('wsstagestat,', '')

            match3h = re.sub("{.*?},", '', match3h)

我在FA Cup Fixtures标题下显示的赛程和分数之后。您可以使用页面上的日历选择所需的游戏周。如果你看一下源代码,它只包含最近的游戏周(因为这是现在的最后一个赛季,那就是足总杯决赛)。

I am after the fixtures and scores that are displayed under the title 'FA Cup Fixtures'. You can select the game week you want using the calendar on the page itself. If you look at the source code though, it only contains the most recent game week (as this is last season now, that is the FA Cup Final).

数据前几周的所有内容都不在此页面的源代码中。您使用的日历似乎在代码中生成一个项目:

The data for all previous weeks are not on the source code for this page. The calendar that you use seems to be generating an item within the code called:

stageFixtures.load(calendarParameter)

这(如果我理解正确的话似乎可以控制选择哪个游戏周显示。我想知道的是:

This (if I have understood correctly seems to control which game week is selected for display. What I want to know is:

1)这个假设是否正确?
2)源代码中是否有某个地方指向其他URL按周存储分数(我很确定没有,但我是Javascript的新手)?

1) Is that assumption correct? 2) Is there somewhere within the source code that is directing to other URL's storing the scores by week (I'm pretty sure there isn't but I'm really new to Javascript)?

谢谢

推荐答案

XHR 请求加载灯具。模拟它并获取数据。

There is an XHR request going to load the fixtures. Simulate it and get the data.

例如, 2014年1月的灯具

from ast import literal_eval
from datetime import datetime
import requests

date = datetime(year=2014, month=1, day=1)
url = 'http://www.whoscored.com/tournamentsfeed/8273/Fixtures/'

params = {'d': date.strftime('%Y%m'), 'isAggregate': 'false'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

response = requests.get(url, params=params, headers=headers)

fixtures = literal_eval(response.content)
print fixtures

打印:

[
    [789692, 1, 'Saturday, Jan 4 2014', '12:45', 158, 'Blackburn', 0, 167, 'Manchester City', 1, '1 : 1', '0 : 1', 1, 1, 'FT', '0', 0, 0, 4, 1], 
    [789693, 1, 'Saturday, Jan 4 2014', '15:00', 31, 'Everton', 0, 171, 'Queens Park Rangers', 0, '4 : 0', '2 : 0', 1, 0, 'FT', '1', 0, 0, 1, 0],
    ...
]

请注意,响应不是json,而是基本上是Python列表列表的转储,您可以使用 ast.literal_eval()

Note that the response is not a json, but a basically a dump of Python's list of lists, you can load it with ast.literal_eval():


安全地评估表达式节点或包含Python表达式的Unicode或Latin-1编码的
字符串。提供的字符串或节点可能
只包含以下Python文字结构:字符串,
数字,元组,列表,dicts,布尔值和无。

Safely evaluate an expression node or a Unicode or Latin-1 encoded string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

这篇关于使用Scrapy进行Scraping时,无法在源代码中找到显示的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆