使用scrapy在没有javascript代码的情况下抓取文本 [英] Scraping text without javascript code using scrapy
问题描述
我目前正在使用scrapy设置一堆蜘蛛.这些蜘蛛应该从目标站点仅提取文本(文章、论坛帖子、段落等).
问题是:有时,我的目标节点包含一个 <script>
标签,因此抓取的文本包含 javascript 代码.
这是一个链接 到我正在使用的真实示例.在这种情况下,我的目标节点是 //td[@id='contenuStory']
.问题是在第一个子 div 中有一个 标签.
我花了很多时间在网络和 SO 上寻找解决方案,但找不到任何东西.我希望我没有遗漏一些明显的东西!
示例
HTML 响应(仅目标节点):
<div id="part1">一些文字</div><script>var s = 'javascript 我不想要';</script><div id="part2">其他一些文字</div>
我想要的东西:
一些文字其他一些文字
我得到了什么:
一些文字var s = 'javascript 我不想要';其他一些文字
我的代码
给定一个 xpath 选择器,我使用以下函数来提取文本:
def getText(hxs):如果 len(hxs) >0:l = hxs.select('string(.)')如果 len(l) >0:s = l[0].extract().encode('utf-8')别的:s = hxs[0].extract().encode('utf-8')返回别的:返回 0
我尝试过使用 XPath 轴(例如 child::script
),但无济于事.
尝试 w3lib.html
中的 utils 函数:
from w3lib.html import remove_tags, remove_tags_with_contentinput = hxs.select('//div[@id="content"]').extract()输出 = remove_tags(remove_tags_with_content(input, ('script', )))
I'm currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the target sites.
The problem is : sometimes, my target node contains a <script>
tag and so the scraped text contains javascript code.
Here is a link to a real example of what I'm working with. In this case my target node is //td[@id='contenuStory']
. The problem is that there's a <script>
tag in the first child div.
I've spent a lot of time searching for a solution on the web and on SO, but I couldn't find anything. I hope I haven't missed something obvious !
Example
HTML response (only the target node) :
<div id="content">
<div id="part1">Some text</div>
<script>var s = 'javascript I don't want';</script>
<div id="part2">Some other text</div>
</div>
What I want in my item :
Some text
Some other text
What I get :
Some text
var s = 'javascript I don't want';
Some other text
My code
Given an xpath selector I'm using the following function to extract the text :
def getText(hxs):
if len(hxs) > 0:
l = hxs.select('string(.)')
if len(l) > 0:
s = l[0].extract().encode('utf-8')
else:
s = hxs[0].extract().encode('utf-8')
return s
else:
return 0
I've tried using XPath axes (things like child::script
) but to no avail.
Try utils functions from w3lib.html
:
from w3lib.html import remove_tags, remove_tags_with_content
input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))
这篇关于使用scrapy在没有javascript代码的情况下抓取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!