使用 XPath、Python 和 Scrapy 解析 HTML [英] Parsing HTML with XPath, Python and Scrapy
问题描述
我正在编写一个 Scrapy 程序来提取数据.
I am writing a Scrapy program to extract the data.
这个是url,我要爬20111028013117
(代码)信息.我从 FireFox 附加组件 XPather 中获取了 XPath.这是路径:
This is the url, and I want to scrape 20111028013117
(code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
虽然我正在尝试执行此操作
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
它返回一个空列表,我正在努力寻找过去 4 小时内的答案.我是scrapy的新手,尽管我在其他项目中处理问题非常好,但似乎有点困难.
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
推荐答案
你的 xpath 不工作的原因是 tbody
.您必须将其删除并检查是否获得了您想要的结果.
The reason of why your xpath doesn't work is becuase of tbody
. You have to remove it and check if you get that result that you want.
你可以在scrapy文档中阅读:http://doc.scrapy.org/en/0.14/topics/firefox.html
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox,以将 Firefox, in particular, is known for adding 这篇关于使用 XPath、Python 和 Scrapy 解析 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! 元素添加到表.另一方面,Scrapy 不会修改原始页面HTML,因此如果您在其中使用
,您将无法提取任何数据您的 XPath 表达式.
<tbody>
elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody>
in
your XPath expressions.
登录
关闭