使用XPath包含Java中的HTML [英] Using XPath Contains against HTML in Java

查看:319
本文介绍了使用XPath包含Java中的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用java程序中的XPath从HTML页面中抓取值来获取特定标记,偶尔使用正则表达式来清理我收到的数据。

I'm scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive.

经过一番研究后,我登陆了HTML Cleaner( http:/ /htmlcleaner.sourceforge.net/ )是将原始HTML解析为良好XML格式的最可靠方法。但是,HTML Cleaner只支持XPath 1.0,我发现自己需要像'contains'这样的函数。例如,在这段XML中:

After some research, I landed on HTML Cleaner ( http://htmlcleaner.sourceforge.net/ ) as the most reliable way to parse raw HTML into a good XML format. HTML Cleaner, however, only supports XPath 1.0 and I find myself needing functions like 'contains'. for instance, in this piece of XML:

<div>
  <td id='1234 foo 5678'>Hello</td>
</div>

我希望能够通过以下XPath获取文本'Hello':

I would like to be able to get the text 'Hello' with the following XPath:

//div/td[contains(@id, 'foo')]/text()

有什么方法可以获得此功能吗?我有几个想法,但如果我不需要,我宁愿不重新发明轮子:

Is there any way to get this functionality? I have several ideas, but would prefer not to reinvent the wheel if I don't need to:


  • 如果有办法打电话HTML Cleaner的evaluateXPath并返回一个TagNode(我还没有找到),我可以在返回的TagNode上使用XML序列化程序并将XPath链接在一起以实现所需的功能。

  • 我可以使用HTML清理XML以清除,将其序列化为字符串,并将其与另一个XPath库一起使用,但我找不到适用于字符串的好的java XPath求值程序。

  • 使用TagNode函数类似于getElementsByAttValue,我基本上可以重新创建XPath评估并使用String.contains插入包含功能。

简短的问题:有没有使用XPath的方法包含在现有Java库中的HTML上吗?

Short question: Is there any way to use XPath contains on HTML inside an existing Java Library?

推荐答案

关于此:


我可以使用HTML Cleaner清理XML,序列化它k到
字符串,并将其与另一个XPath库一起使用,但我找不到一个对字符串起作用的
优秀的java XPath求值程序。

I could use HTML Cleaner to clean to XML, serialize it back to a string, and use that with another XPath library, but I can't find a good java XPath evaluator that works on a string.

这是完全我会做什么(除了你不需要操作一个字符串(见下文))。

This is exactly what I would do (except you don't need to operate on a string (see below)).

很多HTML解析器尝试做太多。例如,HTMLCleaner没有正确/完全实现XPath 1.0规范(包含(例如) 一个XPath 1.0函数)。好消息是你不需要它。 HTMLCleaner所需要的只是解析格式错误的输入。完成后,最好使用标准XML接口来处理生成的(现在格式良好的)文档。

A lot of HTML parsers try to do too much. HTMLCleaner, for example, does not properly/completely implement the XPath 1.0 spec (contains (for example) is an XPath 1.0 function). The good news is that you don't need it to. All you need from HTMLCleaner is for it to parse the malformed input. Once you've done that, it's better to use the standard XML interfaces to deal with the resulting (now well-formed) document.

首先将文档转换为标准 org.w3c.dom.Document ,如下所示:

First convert the document into a standard org.w3c.dom.Document like this:

TagNode tagNode = new HtmlCleaner().clean(
        "<div><table><td id='1234 foo 5678'>Hello</td>");
org.w3c.dom.Document doc = new DomSerializer(
        new CleanerProperties()).createDOM(tagNode);

然后使用标准的JAXP接口查询它:

And then use the standard JAXP interfaces to query it:

XPath xpath = XPathFactory.newInstance().newXPath();
String str = (String) xpath.evaluate("//div//td[contains(@id, 'foo')]/text()", 
                       doc, XPathConstants.STRING);
System.out.println(str);

输出:

Hello

这篇关于使用XPath包含Java中的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆