如何使用scrapy解析JS中的链接? [英] How can i use scrapy to parse links in JS?

查看:108
本文介绍了如何使用scrapy解析JS中的链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scrapy 来解析页面上的链接以进行抓取.不幸的是,此页面上的链接包含在 JavaScript onclick 函数中.如果可能,我想使用 SgmlLinkExtractor 规则提取链接以解析 JavaScript 并创建 URL 以与 callback='parse_item' 一起使用.

以下是每个链接与 JS 函数的示例:

<a onclick="window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');"href="#internalpagelink">链接文本</a>

我只需要将链接提取器发送到回调 parse_item:http://domain.com/page.asp?ProductID=3679

我将如何编写 CrawlSpider 规则来执行此操作?

如果这是不可能的,那么最终能够解析在一组定义的起始页面上嵌入这种 JavaScript 链接格式的所有页面的最佳方法是什么?

谢谢大家.

解决方案

您可以使用 SgmlLinkExtractor.

<块引用>
  • attrs (list) – 查找要提取的链接时应考虑的属性列表(仅适用于 tags 参数中指定的那些标签).默认为 ('href',)

process_value 参数来自 BaseSgmlLinkExtractor:

<块引用>
  • process_value(可调用) –一个函数,它接收从标签和扫描的属性中提取的每个值,并且可以修改该值并返回一个新值,或者返回 None 以完全忽略链接.如果没有给出,process_value 默认为 lambda x: x.

因此,您将为onclick"属性的值编写一个解析函数:

def process_onclick(value):m = re.search("window.open\('(.+?)'", value)如果米:返回 m.group(1)

让我们检查一下正则表达式:

<预><代码>>>>re.search("window.open\('(.+?)'",... "window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');"... ).group(1)'page.asp?ProductID=3679'>>>

然后在带有 SgmlLinkExtractor

Rule 中使用它

规则=(规则(SgmlLinkExtractor(允许=(),attrs=('onclick',),process_value=process_onclick),callback='parse_item'),)

I am trying to get scrapy to parse the links on a page to scrape. Unfortunatly the links on this page are enclosed in a JavaScript onclick function. I would like to use the SgmlLinkExtractor rule to extract the link to parse the JavaScript and create the URL to use with callback='parse_item' if possible.

Here is an example of the each link with the JS function:

<a onclick="window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');" href="#internalpagelink">Link Text</a>

I just need the link extractor to send to callback parse_item: http://domain.com/page.asp?ProductID=3679

How would i write CrawlSpider rules to do this?

If this is not possible what would be the best way to end up being able to parse all pages embeded in this format of JavaScript links on a defined set of start pages?

Thank you all.

解决方案

You can use the attrs parameter of SgmlLinkExtractor.

  • attrs (list) – list of attributes which should be considered when looking for links to extract (only for those tags specified in the tags parameter). Defaults to ('href',)

and process_value parameter from BaseSgmlLinkExtractor:

  • process_value (callable) – a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

So you would write a parsing function for "onclick" attributes' values:

def process_onclick(value):
    m = re.search("window.open\('(.+?)'", value)
    if m:
        return m.group(1)

Let's check that regular expression:

>>> re.search("window.open\('(.+?)'",
...           "window.open('page.asp?ProductID=3679','productwin','width=700,height=475,scrollbars,resizable,status');"
...          ).group(1)
'page.asp?ProductID=3679'
>>> 

And then use it in a Rule with SgmlLinkExtractor

rules=(
    Rule(SgmlLinkExtractor(allow=(),
                           attrs=('onclick',),
                           process_value=process_onclick),
         callback='parse_item'),
)

这篇关于如何使用scrapy解析JS中的链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆