使用Tika解析器的XPath应用程序 [英] XPath application using tika parser
问题描述
我想清理不规则的Web内容-(可能是html,pdf图像等)大部分是html.我正在为此使用tika解析器.但是我不知道如何在HTML清洁器中使用xpath.
I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.
我使用的代码是
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach- drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());
但是在这种情况下,我没有输出.但是对于url.google.com,我得到了输出.
But in this case I am getting no output. But for the url- google.com I am getting output.
无论哪种情况,我都不知道如何应用xpath.
In either case I don't know how to apply the xpath.
请提出任何想法...
Any ideas please...
尝试通过将我的自定义xpath设置为正文内容处理程序的使用方式,
Tried by making my custom xpath as how body content handler uses,
HttpClient client = new HttpClient();
GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
int status = client.executeMethod(method);
HtmlParser parse = new HtmlParser();
XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");
//Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
Matcher matcher = parser.parse("/html/body//h1");
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);
System.out.println("content: " + textHandler.toString());
但无法在给定的xpath中获取内容.
But not getting the content in the given xpath..
推荐答案
I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath
In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.
这篇关于使用Tika解析器的XPath应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!