使用 tika 解析器的 XPath 应用程序 [英] XPath application using tika parser

查看:27
本文介绍了使用 tika 解析器的 XPath 应用程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想清理不规则的网页内容 -(可能是 html、pdf 图像等)主要是 html.我为此使用了 tika 解析器.但我不知道如何在 html 清洁器中使用 xpath.

I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.

我使用的代码是,

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-    drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());

但在这种情况下,我没有得到任何输出.但是对于 url-google.com,我得到了输出.

But in this case I am getting no output. But for the url- google.com I am getting output.

无论哪种情况,我都不知道如何应用 xpath.

In either case I don't know how to apply the xpath.

任何想法请...

尝试将我的自定义 xpath 作为正文内容处理程序的使用方式,

Tried by making my custom xpath as how body content handler uses,

HttpClient client = new HttpClient();
        GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
        int status = client.executeMethod(method);
        HtmlParser parse = new HtmlParser();
        XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");          
        //Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
       Matcher matcher = parser.parse("/html/body//h1");        
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
        Metadata metadata = new Metadata(); 
        ParseContext context = new ParseContext();
        parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);   
        System.out.println("content: " + textHandler.toString()); 

但没有获取给定 xpath 中的内容..

But not getting the content in the given xpath..

推荐答案

我建议你看一下 BodyContentHandler,随 Tika 一起提供.BodyContentHandler 只返回 body 标签内的 xml,基于 xpath

I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath

一般来说,你应该使用 MatchingContentHandler 用 XPath 包装您选择的 ContentHandler,这就是 BodyContentHandler 在内部所做的.

In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.

这篇关于使用 tika 解析器的 XPath 应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆