使用 XPath、Python 和 Scrapy 解析 HTML [英] Parsing HTML with XPath, Python and Scrapy

查看:48
本文介绍了使用 XPath、Python 和 Scrapy 解析 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 Scrapy 程序来提取数据.

I am writing a Scrapy program to extract the data.

这个是url,我要爬20111028013117(代码)信息.我从 FireFox 附加组件 XPather 中获取了 XPath.这是路径:

This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:

/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]

虽然我正在尝试执行此操作

While I am trying to execute this

try:
    temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
    print "temp_list:" + str(temp_list)
except:
    print "error"

它返回一个空列表,我正在努力寻找过去 4 小时内的答案.我是scrapy的新手,尽管我在其他项目中处理问题非常好,但似乎有点困难.

It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.

推荐答案

你的 xpath 不工作的原因是 tbody.您必须将其删除并检查是否获得了您想要的结果.

The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.

你可以在scrapy文档中阅读:http://doc.scrapy.org/en/0.14/topics/firefox.html

You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html

Firefox,以将 元素添加到表.另一方面,Scrapy 不会修改原始页面HTML,因此如果您在其中使用 ,您将无法提取任何数据您的 XPath 表达式.

Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.

这篇关于使用 XPath、Python 和 Scrapy 解析 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆