使用Tika解析器的XPath应用程序 [英] XPath application using tika parser

查看:113
本文介绍了使用Tika解析器的XPath应用程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想清理不规则的Web内容-(可能是html,pdf图像等)大部分是html.我正在为此使用tika解析器.但是我不知道如何在HTML清洁器中使用xpath.

I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.

我使用的代码是

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-    drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());

但是在这种情况下,我没有输出.但是对于url.google.com,我得到了输出.

But in this case I am getting no output. But for the url- google.com I am getting output.

无论哪种情况,我都不知道如何应用xpath.

In either case I don't know how to apply the xpath.

请提出任何想法...

Any ideas please...

尝试通过将我的自定义xpath设置为正文内容处理程序的使用方式,

Tried by making my custom xpath as how body content handler uses,

HttpClient client = new HttpClient();
        GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
        int status = client.executeMethod(method);
        HtmlParser parse = new HtmlParser();
        XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");          
        //Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
       Matcher matcher = parser.parse("/html/body//h1");        
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
        Metadata metadata = new Metadata(); 
        ParseContext context = new ParseContext();
        parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);   
        System.out.println("content: " + textHandler.toString()); 

但无法在给定的xpath中获取内容.

But not getting the content in the given xpath..

推荐答案

我建议您看一下

I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath

不过,一般而言,您应该使用

In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.

这篇关于使用Tika解析器的XPath应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆