使用美丽的汤从scrapy清理刮掉的HTML [英] Using beautiful soup to clean up scraped HTML from scrapy

查看:28
本文介绍了使用美丽的汤从scrapy清理刮掉的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 尝试从 Google Scholar 中抓取一些我需要的数据.以以下链接为例:http://scholar.google.com/scholar?q=intitle%3Apython+xpath

现在,我想从这个页面上刮掉所有的标题.我遵循的过程如下:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

这给了我scrapy shell,我在里面做:

<预><代码>>>>sel.xpath('//h3[@class="gs_rt"]/a').extract()[u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&amp;rep=rep1&amp;type=pdf"><b>Python </b>XML 范式</a>',u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>NETCONF 客户端库</a>',u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE:<b>Python</b>大规模(计算机)实验分析</a>',u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>和 XML</a>',u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx:<b>Python </b>编程语言[计算机:编程:语言:<b>Python</b>]-loadaverageZero</a>',u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML 和 <b>Python </b>教程</a>',u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014敏捷 ESB、SOA、REST 和云集成在 <b>Python</b><;/a>',u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">使用 Perl、<b>Python</b> 和 PHP</a> 的 XML 处理,u'<a href="http://books.google.com/books?hl=en&amp;lr=&amp;id=El4TAgAAQBAJ&amp;oi=fnd&amp;pg=PT8&amp;dq=python+xpath&ots=RrFv0f_Y6V&sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>&amp;XML</a>',u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>对于 NETCONF 客户端']

如您所见,此输出是需要清理的原始 HTML.我现在对如何清理这个 HTML 有了很好的了解.最简单的方法可能是只使用 BeautifulSoup 并尝试以下操作:

t = sel.xpath('//h3[@class="gs_rt"]/a').extract()汤 = BeautifulSoup(t)text_parts = 汤.findAll(text=True)text = ''.join(text_parts)

这是基于较早的SO 问题.已建议使用正则表达式版本,但我猜测 BeautifulSoup 会更健壮.

我是一个无用的 n00b,不知道如何将它嵌入到我的蜘蛛中.我试过了

from scrapy.spider import Spiderfrom scrapy.selector import Selector从 bs4 导入 BeautifulSoup从scholarscrape.items 导入ScholarscrapeItem类学者蜘蛛(蜘蛛):姓名 = "学者"allowed_domains = ["scholar.google.com"]start_urls = ["http://scholar.google.com/scholar?q=intitle%3Apython+xpath"]定义解析(自我,响应):sel = 选择器(响应)item = ScholarscrapeItem()t = sel.xpath('//h3[@class="gs_rt"]/a').extract()汤 = BeautifulSoup(t)text_parts = 汤.findAll(text=True)text = ''.join(text_parts)item['title'] = 文本归还物品)

但这并没有完全奏效.任何建议都会有所帮助.

<小时>

编辑 3:根据建议,我已将蜘蛛文件修改为:

from scrapy.spider import Spiderfrom scrapy.selector import Selector从 bs4 导入 BeautifulSoup从scholarscrape.items 导入ScholarscrapeItem类学者蜘蛛(蜘蛛):名称 = "dmoz"allowed_domains = ["sholar.google.com"]start_urls = [http://scholar.google.com/scholar?q=intitle%3Anine+facts+about+top+journals+in+economics"]定义解析(自我,响应):sel = 选择器(响应)item = ScholarscrapeItem()titles = sel.xpath('//h3[@class="gs_rt"]/a')对于标题中的标题:title = item.xpath('.//text()').extract()打印 "".join(title)

但是,我得到以下输出:

<前>2014-02-17 15:11:12-0800 [scrapy] INFO:Scrapy 0.22.2 开始(机器人:scholarcrape)2014-02-17 15:11:12-0800 [scrapy] 信息:可用的可选功能:ssl、http112014-02-17 15:11:12-0800 [scrapy] 信息:覆盖设置:{'NEWSPIDER_MODULE': 'scholarscrape.spiders', 'SPIDER_MODULES': ['scholarscrape.spiders'], 'BOT_NAME':}2014-02-17 15:11:12-0800 [scrapy] 信息:启用扩展:LogStats、TelnetConsole、CloseSpider、WebService、CoreStats、SpiderState2014-02-17 15:11:13-0800 [scrapy] 信息:启用下载器中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectkedTransiddleMiddleware、DownloadMiddleware、DownloadMiddleware2014-02-17 15:11:13-0800 [scrapy] 信息:启用蜘蛛中间件:HttpErrorMiddleware、OffsiteMiddleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware2014-02-17 15:11:13-0800 [scrapy] 信息:启用项目管道:2014-02-17 15:11:13-0800 [dmoz] 信息:蜘蛛打开2014-02-17 15:11:13-0800 [dmoz] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2014-02-17 15:11:13-0800 [scrapy] 调试:Telnet 控制台监听 0.0.0.0:60232014-02-17 15:11:13-0800 [scrapy] 调试:Web 服务监听 0.0.0.0:60802014-02-17 15:11:13-0800 [dmoz] 调试:爬行 (200) <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>(参考:无)2014-02-17 15:11:13-0800 [dmoz] 错误:蜘蛛错误处理<GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>回溯(最近一次调用最后一次):文件/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py",第1178行,在mainLoop中self.runUntilCurrent()文件/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py",第 800 行,在 runUntilCurrent 中call.func(*call.args, **call.kw)文件/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py",第368行,回调self._startRunCallbacks(result)文件/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py",第 464 行,在 _startRunCallbacksself._runCallbacks()--- <在此处捕获异常>---文件/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py",第 551 行,在 _runCallbackscurrent.result = callback(current.result, *args, **kw)文件/Users/krishnan/work/research/journals/code/scholarscrape/scholarscrape/spiders/scholar_spider.py",第 20 行,解析title = item.xpath('.//text()').extract()文件/Library/Python/2.7/site-packages/scrapy/item.py",第 65 行,在 __getattr__ 中引发 AttributeError(name)异常.属性错误:xpath2014-02-17 15:11:13-0800 [dmoz] 信息:关闭蜘蛛(已完成)2014-02-17 15:11:13-0800 [dmoz] 信息:倾销 Scrapy 统计信息:{'下载器/请求字节':247,'下载者/请求计数':1,'下载器/request_method_count/GET': 1,下载器/响应字节":108851,'下载者/响应计数':1,'下载器/response_status_count/200':1,'finish_reason': '完成','finish_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 196648),'log_count/DEBUG': 3,日志计数/错误":1,'log_count/INFO': 7,'response_received_count':1,'调度程序/出队':1,'调度程序/出队/内存':1,'调度程序/排队':1,'调度程序/排队/内存':1,'spider_exceptions/AttributeError': 1,'start_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 21701)}2014-02-17 15:11:13-0800 [dmoz] 信息:蜘蛛关闭(已完成)

<小时>

编辑 2:我最初的问题完全不同,但我现在确信这是正确的方法.原始问题(以及下面的第一次编辑):

我正在使用 scrapy 尝试从 Google Scholar 中抓取一些我需要的数据.以以下链接为例:

http://scholar.google.com/scholar?q=intitle%3Apython+xpath

现在,我想从这个页面上刮掉所有的标题.我遵循的过程如下:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

这给了我scrapy shell,我在里面做:

<预><代码>>>>sel.xpath('string(//h3[@class="gs_rt"]/a)').extract()[u'Python 的 XML 范式']

如您所见,这只会选择第一个标题,而不会选择页面上的其他标题.我不知道我应该将 XPath 修改为什么,因此我选择了页面上的所有此类元素.非常感谢任何帮助.

<小时>

编辑 1:我的第一个方法是尝试

<预><代码>>>>sel.xpath('//h3[@class="gs_rt"]/a/text()').extract()[u'Paradigms for XML', u'NCClient: A ', u'用于 NETCONF 客户端的库',u'PALSE:', u'大规模(计算机)实验分析', u'and XML',u'drx: ', u'Programming Language [Computers: Programming: Languages: ',u']-loadaverageZero', u'XML 和 ', u'Tutorial',u'Zato\u2014敏捷 ESB、SOA、REST 和云集成在 ',u'XML Processing with Perl, ', u', and PHP', u'&XML', u'A ',u'用于 NETCONF 客户端的模块']

这种方法的问题在于,如果您查看实际的 Google Scholar 页面,您会发现第一个条目实际上是Python XML 范式"而不是XML 范式"当scrapy返回时.我对这种行为的猜测是Python"被困在标签中,这就是 text() 没有做我们希望他做的事情的原因.

解决方案

这是一个非常有趣且相当困难的问题.您面临的问题涉及标题中的Python"以粗体显示,并将其视为节点,而标题的其余部分只是一个文本,因此 text() 仅提取文本内容而不提取内容 节点.

这是我的解决方案.首先获取所有链接:

<代码>titles = sel.xpath('//h3[@class="gs_rt"]/a')

然后遍历它们并选择每个节点的所有文本内容,换句话说,将 <b> 节点与此链接的每个子节点的文本节点连接

对于标题中的项目:title = item.xpath('.//text()').extract()打印 "".join(title)

这是有效的,因为在 for 循环中,您将处理每个链接的子项的文本内容,因此您将能够加入匹配的元素.循环中的标题将相等,例如:[u'Python ', u'Paradigms for XML'][u'NCClient: A ', u'Python ', u'LibraryNETCONF 客户端的]

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath

Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

which gives me the scrapy shell, inside which I do:

>>> sel.xpath('//h3[@class="gs_rt"]/a').extract()

[
 u'<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.4438&amp;rep=rep1&amp;type=pdf"><b>Python </b>Paradigms for XML</a>', 
 u'<a href="https://svn.eecs.jacobs-university.de/svn/eecs/archive/bsc-2009/sbhushan.pdf">NCClient: A <b>Python </b>Library for NETCONF Clients</a>', 
 u'<a href="http://hal.archives-ouvertes.fr/hal-00759589/">PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments</a>', 
 u'<a href="http://i.iinfo.cz/r2/kd/xmlprague2007.pdf#page=53"><b>Python </b>and XML</a>', 
 u'<a href="http://www.loadaveragezero.com/app/drx/Programming/Languages/Python/">drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero</a>', 
 u'<a href="http://www.worldcolleges.info/sites/default/files/py10.pdf">XML and <b>Python </b>Tutorial</a>', 
 u'<a href="http://dl.acm.org/citation.cfm?id=2555791">Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b></a>', 
 u'<a href="ftp://ftp.sybex.com/4021/4021index.pdf">XML Processing with Perl, <b>Python</b>, and PHP</a>', 
 u'<a href="http://books.google.com/books?hl=en&amp;lr=&amp;id=El4TAgAAQBAJ&amp;oi=fnd&amp;pg=PT8&amp;dq=python+xpath&amp;ots=RrFv0f_Y6V&amp;sig=tSXzPJXbDi6KYnuuXEDnZCI7rDA"><b>Python </b>&amp; XML</a>', 
 u'<a href="https://code.grnet.gr/projects/ncclient/repository/revisions/efed7d4cd5ac60cbb7c1c38646a6d6dfb711acc9/raw/docs/proposal.pdf">A <b>Python </b>Module for NETCONF Clients</a>'
]

As you can see, this output is raw HTML that needs cleaning. I now have a good sense of how to clean this HTML up. The simplest way is probably to just BeautifulSoup and try something like:

t = sel.xpath('//h3[@class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

This is based off an earlier SO question. The regexp version has been suggested, but I am guessing that BeautifulSoup will be more robust.

I'm a scrapy n00b and can't figure out how to embed this in my spider. I tried

from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "scholar"
    allowed_domains = ["scholar.google.com"]
    start_urls = [
        "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
    ]

    def parse(self, response):
        sel = Selector(response)
        item = ScholarscrapeItem()        
        t = sel.xpath('//h3[@class="gs_rt"]/a').extract()
        soup = BeautifulSoup(t)
        text_parts = soup.findAll(text=True)
        text = ''.join(text_parts)
        item['title'] = text
        return(item)

But that didn't quite work. Any suggestions would be helpful.


Edit 3: Based on suggestions, I have modified my spider file to:

from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "dmoz"
    allowed_domains = ["sholar.google.com"]
    start_urls = [
        "http://scholar.google.com/scholar?q=intitle%3Anine+facts+about+top+journals+in+economics"
    ]

    def parse(self, response):
        sel = Selector(response)
        item = ScholarscrapeItem()        
        titles = sel.xpath('//h3[@class="gs_rt"]/a')

        for title in titles:
            title = item.xpath('.//text()').extract()
            print "".join(title)

However, I get the following output:

2014-02-17 15:11:12-0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: scholarscrape)
2014-02-17 15:11:12-0800 [scrapy] INFO: Optional features available: ssl, http11
2014-02-17 15:11:12-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scholarscrape.spiders', 'SPIDER_MODULES': ['scholarscrape.spiders'], 'BOT_NAME': 'scholarscrape'}
2014-02-17 15:11:12-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider opened
2014-02-17 15:11:13-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-17 15:11:13-0800 [dmoz] DEBUG: Crawled (200) <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml> (referer: None)
2014-02-17 15:11:13-0800 [dmoz] ERROR: Spider error processing <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>
 Traceback (most recent call last):
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
     self.runUntilCurrent()
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
     call.func(*call.args, **call.kw)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
     self._startRunCallbacks(result)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/Users/krishnan/work/research/journals/code/scholarscrape/scholarscrape/spiders/scholar_spider.py", line 20, in parse
     title = item.xpath('.//text()').extract()
   File "/Library/Python/2.7/site-packages/scrapy/item.py", line 65, in __getattr__
     raise AttributeError(name)
 exceptions.AttributeError: xpath

2014-02-17 15:11:13-0800 [dmoz] INFO: Closing spider (finished)
2014-02-17 15:11:13-0800 [dmoz] INFO: Dumping Scrapy stats:
 {'downloader/request_bytes': 247,
  'downloader/request_count': 1,
  'downloader/request_method_count/GET': 1,
  'downloader/response_bytes': 108851,
  'downloader/response_count': 1,
  'downloader/response_status_count/200': 1,
  'finish_reason': 'finished',
  'finish_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 196648),
  'log_count/DEBUG': 3,
  'log_count/ERROR': 1,
  'log_count/INFO': 7,
  'response_received_count': 1,
  'scheduler/dequeued': 1,
  'scheduler/dequeued/memory': 1,
  'scheduler/enqueued': 1,
  'scheduler/enqueued/memory': 1,
  'spider_exceptions/AttributeError': 1,
  'start_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 21701)}
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider closed (finished)


Edit 2: My original question was quite different, but I am now convinced that this is the right way to proceed. Original question (and first edit below):

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link:

http://scholar.google.com/scholar?q=intitle%3Apython+xpath

Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:

scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"

which gives me the scrapy shell, inside which I do:

>>> sel.xpath('string(//h3[@class="gs_rt"]/a)').extract()
[u'Python Paradigms for XML']

As you can see, this only selects the first title, and none of the others on the page. I can't figure out what I should modify my XPath to, so that I select all such elements on the page. Any help is greatly appreciated.


Edit 1: My first approach was to try

>>> sel.xpath('//h3[@class="gs_rt"]/a/text()').extract()
[u'Paradigms for XML', u'NCClient: A ', u'Library for NETCONF Clients', 
 u'PALSE: ', u'Analysis of Large Scale (Computer) Experiments', u'and XML', 
 u'drx: ', u'Programming Language [Computers: Programming: Languages: ',
 u']-loadaverageZero', u'XML and ', u'Tutorial', 
 u'Zato\u2014agile ESB, SOA, REST and cloud integrations in ', 
 u'XML Processing with Perl, ', u', and PHP', u'& XML', u'A ', 
 u'Module for NETCONF Clients']

The problem with this is approach is that if you look at the actual Google Scholar page, you will see that the first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as scrapy returns. My guess for this behaviour is that 'Python' is trapped inside tags which is why text() is not doing what we want him to do.

解决方案

This is a really interesting and rather difficult question. The problem you're facing concerns the fact that "Python" in the title is in bold, and it is treated as node, while the rest of the title is simply a text, therefore text() extracts only textual content and not content of <b> node.

Here's my solution. First get all the links:

titles = sel.xpath('//h3[@class="gs_rt"]/a')

then iterate over them and select all textual content of each node, in other words join <b> node with text node for each children of this link

for item in titles:
    title = item.xpath('.//text()').extract()
    print "".join(title)

This works because in a for loop you will be dealing with textual content of children of each link and thus you will be able to join matching elements. Title in the loop will be equal for instance :[u'Python ', u'Paradigms for XML'] or [u'NCClient: A ', u'Python ', u'Library for NETCONF Clients']

这篇关于使用美丽的汤从scrapy清理刮掉的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆