如何使用 jtidy 和 xpath 提取数据 [英] how to extract data using jtidy and xpath

查看:36
本文介绍了如何使用 jtidy 和 xpath 提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从中提取公司名称和面值http://money.rediff.com/companies/20-microns-ltd/15110088

i have to extract d company name and face value from http://money.rediff.com/companies/20-microns-ltd/15110088

我注意到这个任务可以使用 xpath api 来完成.由于这是一个 html 页面,我使用的是 jtidy 解析器.

i noticed that this task could be accomplished using xpath api. since this is an html page, i am using jtidy parser.

这是我必须提取的面值的 xpath.

this is the xpath for the face value which i have to extract.

/html/body/div[4]/div[6]/div[9]/div/table/tbody/tr[4]/td[2]

这是我的代码

URL oracle = new URL("http://money.rediff.com/companies/20-microns-ltd/15110088");
URLConnection yc = oracle.openConnection();
InputStream is = yc.getInputStream();
is = oracle.openStream();
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document tidyDOM = tidy.parseDOM(is, null);
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
String expression = "/html";
XPathExpression xPathExpression = xPath.compile(expression);
Object result = xPathExpression.evaluate(tidyDOM,XPathConstants.NODESET);
System.out.println(result.toString());

请进一步指导我,因为我找不到上述问题的正确解决方案

please guide me further, because, i cannot find a right solution for the above

推荐答案

尽量不要使用完整"的 xpath.

Try not to use "full" xpaths.

//div[@id='leftcontainer']//div[9]//table//tr[4]/td[2]

优于

/html/body/.../.../.../.../.../...

大多数 HTML 页面无效,甚至格式不正确.因此,DOM 结构在被真实世界的 HTML 解析器"处理时可能会发生变化.例如,如果没有 ,可以在

下插入一个 .当不同的 HTML 解析器生成不同的 DOM 树时,情况会更糟,因此一个 XPath 可能对一个解析器有效,但对另一个解析器无效.我宁愿使用通配符",如 table//tr[4] 而不是 table/tbody/tr[4]table/tr[4] 这样我就可以忘记 .当用于处理混乱的现实世界 HTML 页面时,此类表达式更加健壮.

Most HTML pages are not valid or even well-formed. So the DOM structure may change when processed by "real-world HTML parsers". For example, a <tbody> may be inserted under <table> if there isn't one. Things are worse when different HTML parsers generate different DOM trees so one XPath may be valid for one parser, but not the other. I would rather use "wildcards" like table//tr[4] instead of table/tbody/tr[4] or table/tr[4] so that I can forget about <tbody>. Such expressions are more robust when used against the messy real-world HTML pages.

您可以使用 Firepath 来调试 XPath 表达式,它是 Firebug 的插件,然后是 Firefox 的插件.

You can use Firepath, a plugin for Firebug which is then a plugin for Firefox, to debug XPath expressions.

附言你可以试试我的 JHQL (http://github.com/wks/jhql) 项目来完成这个任务.如果您有更多页面可以从中提取数据,您会喜欢它.

p.s. You can try my JHQL (http://github.com/wks/jhql) project for exactly this task. You will like it if you have more pages to extract data from.

这篇关于如何使用 jtidy 和 xpath 提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆