Scrapy:restrict_css 格式错误的 HTML [英] Scrapy : restrict_css with bad formatted HTML
问题描述
我试图抓取的 HTML 代码格式错误:
The HTML code I am trying to crawl is bad formatted :
<html>
<head>...</head>
<body>
My items here...
My items here...
My items here...
Pagination here...
</body>
</head>
</html>
问题是第二个.我必须替换蜘蛛中的 HTML 才能使用 xpath 表达式:
The problem is the second </head>
. I must replace the HTML in my spider to use the xpath expressions :
class FooSpider(CrawlSpider):
name = 'foo'
allowed_domains = ['foo.bar']
start_urls = ['http://foo.bar/index.php?page=1']
rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d',),),
callback="parse_start_url",
follow=True),)
def parse_start_url(self, response):
# Remove the second </head> here
# Perform my item
现在我想在我的规则中使用 restrict_xpath
参数,但我不能,因为 HTML 格式错误:此时尚未执行替换.
Now I want to use the restrict_xpath
argument in my rule, but I can't because the HTML is bad formatted : replacement has not been performed at this time.
你有什么想法吗?
推荐答案
我要做的是写一个 下载中间件 并使用,例如,BeautifulSoup
包,用于修复和美化 response.body
中包含的 HTML - response.replace()
在这种情况下可能会很方便.
What I would do is write a Downloader middleware and use, for instance, BeautifulSoup
package to fix and prettify the HTML contained inside response.body
- response.replace()
might be handy in this case.
请注意,如果您要使用 BeautifulSoup
,请选择 parser 仔细 - 每个解析器都有自己的方式进入损坏的 HTML - 有些或多或少.lxml.html
在速度方面是最好的.
Note that, if you would go with BeautifulSoup
, choose a parser carefully - each parser has it's own way into the broken HTML - some are less or more lenient. lxml.html
would be the best in terms of speed though.
示例:
from bs4 import BeautifulSoup
class MyMiddleware(object):
def process_response(self, request, response, spider):
soup = BeautifulSoup(response.body, "lxml")
response = response.replace(body=soup.prettify())
return response
作为修改下载的 HTML 的自定义中间件的示例,请参阅 scrapy-splash
中间件.
As an example, of a custom middleware that modifies the downloaded HTML, see scrapy-splash
middleware.
这篇关于Scrapy:restrict_css 格式错误的 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!