Scrapy:restrict_css 格式错误的 HTML [英] Scrapy : restrict_css with bad formatted HTML

查看:48
本文介绍了Scrapy:restrict_css 格式错误的 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图抓取的 HTML 代码格式错误:

The HTML code I am trying to crawl is bad formatted :

<html>
<head>...</head>
<body>
    My items here...
    My items here...
    My items here...

    Pagination here...
</body>
</head>
</html>

问题是第二个.我必须替换蜘蛛中的 HTML 才能使用 xpath 表达式:

The problem is the second </head>. I must replace the HTML in my spider to use the xpath expressions :

class FooSpider(CrawlSpider):
    name = 'foo'
    allowed_domains = ['foo.bar']
    start_urls = ['http://foo.bar/index.php?page=1']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d',),),
              callback="parse_start_url",
              follow=True),)

def parse_start_url(self, response):
    # Remove the second </head> here
    # Perform my item

现在我想在我的规则中使用 restrict_xpath 参数,但我不能,因为 HTML 格式错误:此时尚未执行替换.

Now I want to use the restrict_xpath argument in my rule, but I can't because the HTML is bad formatted : replacement has not been performed at this time.

你有什么想法吗?

推荐答案

我要做的是写一个 下载中间件 并使用,例如,BeautifulSoup 包,用于修复和美化 response.body 中包含的 HTML - response.replace() 在这种情况下可能会很方便.

What I would do is write a Downloader middleware and use, for instance, BeautifulSoup package to fix and prettify the HTML contained inside response.body - response.replace() might be handy in this case.

请注意,如果您要使用 BeautifulSoup,请选择 parser 仔细 - 每个解析器都有自己的方式进入损坏的 HTML - 有些或多或少.lxml.html 在速度方面是最好的.

Note that, if you would go with BeautifulSoup, choose a parser carefully - each parser has it's own way into the broken HTML - some are less or more lenient. lxml.html would be the best in terms of speed though.

示例:

from bs4 import BeautifulSoup

class MyMiddleware(object):
    def process_response(self, request, response, spider):
        soup = BeautifulSoup(response.body, "lxml")
        response = response.replace(body=soup.prettify())

        return response

作为修改下载的 HTML 的自定义中间件的示例,请参阅 scrapy-splash 中间件.

As an example, of a custom middleware that modifies the downloaded HTML, see scrapy-splash middleware.

这篇关于Scrapy:restrict_css 格式错误的 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆