Scrapy:提取评论(隐藏)内容 [英] Scrapy: Extract commented (hidden) content

查看:66
本文介绍了Scrapy:提取评论(隐藏)内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用scrapy从评论标签中提取内容?

例如下例中如何提取Yellow":

<div class="信息"><h2 class="Name">Banana</h2><span class="edible">Edible: Yes</span>

<!--<p class="color">黄色</p>-->

解决方案

您可以使用像 //comment() 这样的 XPath 表达式来获取评论内容,然后在剥离后解析该内容评论标签.

scrapy shell 会话示例:

paul@wheezy:~$scrapy shell...在[1]中:doc = """

...:<div class="infos">...:<h2 class="Name">Banana</h2>...:<span class="edible">可食用:是</span>...:</div>...: <!--...:<p class="color">黄色</p>...:-->...:</div>"""在 [2]: from scrapy.selector import Selector在[4]中:selector = Selector(text=doc, type="html")在[5]中:导入重新在 [6] 中:regex = re.compile(r'', re.DOTALL)在 [7] 中:selector.xpath('//comment()').re(regex)输出[7]:[u'\n <p class="color">黄色</p>\n ']在 [8] 中:comment = selector.xpath('//comment()').re(regex)[0]在 [9] 中:commentsel = Selector(text=comment, type="html")在 [10] 中:commentsel.css('p.color')Out[10]: [<Selector xpath=u"descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' color ')]"data=u'<p class="color">Yellow</p>'>]在 [11] 中:commentsel.css('p.color').extract()输出[11]:[u'<p class="color">Yellow</p>']在 [12] 中:commentsel.css('p.color::text').extract()出[12]:[u'黄色']

How can I extract content from within commented tags with scrappy ?

For instance, how to extract "Yellow" in the following example:

<div class="fruit">
    <div class="infos">
        <h2 class="Name">Banana</h2>
        <span class="edible">Edible: Yes</span>
    </div>
    <!--
    <p class="color">Yellow</p>
    -->
</div>

解决方案

You can use an XPath expression like //comment() to get the comment content, and then parse that content after having stripped the comment tags.

Example scrapy shell session:

paul@wheezy:~$ scrapy shell 
...
In [1]: doc = """<div class="fruit">
   ...:     <div class="infos">
   ...:         <h2 class="Name">Banana</h2>
   ...:         <span class="edible">Edible: Yes</span>
   ...:     </div>
   ...:     <!--
   ...:     <p class="color">Yellow</p>
   ...:     -->
   ...: </div>"""

In [2]: from scrapy.selector import Selector

In [4]: selector = Selector(text=doc, type="html")

In [5]: import re

In [6]: regex = re.compile(r'<!--(.*)-->', re.DOTALL)

In [7]: selector.xpath('//comment()').re(regex)
Out[7]: [u'\n    <p class="color">Yellow</p>\n    ']

In [8]: comment = selector.xpath('//comment()').re(regex)[0]

In [9]: commentsel = Selector(text=comment, type="html")

In [10]: commentsel.css('p.color')
Out[10]: [<Selector xpath=u"descendant-or-self::p[@class and contains(concat(' ', normalize-space(@class), ' '), ' color ')]" data=u'<p class="color">Yellow</p>'>]

In [11]: commentsel.css('p.color').extract()
Out[11]: [u'<p class="color">Yellow</p>']

In [12]: commentsel.css('p.color::text').extract()
Out[12]: [u'Yellow']

这篇关于Scrapy:提取评论(隐藏)内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆