在Java中从HTML中提取微数据 [英] Microdata extraction from HTML in Java

查看:128
本文介绍了在Java中从HTML中提取微数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我真的需要帮助来提取HTML5中嵌入的Mircodata。我的目的是从网页获取结构化数据,就像谷歌的这个工具一样: http:// www。 google.com/webmasters/tools/richsnippets 的。我已经搜索了很多,但没有可能的解决方案。

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.

目前,我使用any23库但我找不到任何文档,只有javadocs没有提供足够的我的信息。

Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.

我使用any23的微数据提取器,但卡在第三个参数:org.w3c.dom.Document in。我无法将HTML内容解析为w3cDom。我使用了JTidy以及JSoup,但是这些库中的DOM对象没有使用Extractor构造函数修复。另外,我也怀疑微数据提取器的第二个参数。

I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.

我希望任何人都可以帮我处理任何23或建议另一个库可以解决这个提取问题。

I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.

编辑:我通过使用与any23命令行工具相同的方式自行找到解决方案。以下是代码片段:

Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:

HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");

这些代码行只从HTML中提取微数据并以JSON格式编写。我尝试使用MicrodataExtractor,它可以将输出格式更改为其他格式(Rdf,turtle,...),但输入文档似乎只接受XML格式。当我输入HTML文档时,它会抛出文档未启动。

These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.

如果有人找到了使用MicrodataExtractor的方法,请在此处留下答案。
谢谢。

If anyone found the way to use MicrodataExtractor, please leave the answer here. Thank you.

推荐答案

xpath通常是消费html或xml的方式。

xpath is generally the way to consume html or xml.

查看:如何阅读使用Java中的XPath的XML

这篇关于在Java中从HTML中提取微数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆