在Java中从HTML中提取微数据 [英] Microdata extraction from HTML in Java

查看：128 发布时间：2018/12/29 20:41:37 java extraction microdata

本文介绍了在Java中从HTML中提取微数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我真的需要帮助来提取HTML5中嵌入的Mircodata。我的目的是从网页获取结构化数据，就像谷歌的这个工具一样： http：// www。 google.com/webmasters/tools/richsnippets 的。我已经搜索了很多，但没有可能的解决方案。

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.

目前，我使用any23库但我找不到任何文档，只有javadocs没有提供足够的我的信息。

Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.

我使用any23的微数据提取器，但卡在第三个参数：org.w3c.dom.Document in。我无法将HTML内容解析为w3cDom。我使用了JTidy以及JSoup，但是这些库中的DOM对象没有使用Extractor构造函数修复。另外，我也怀疑微数据提取器的第二个参数。

I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.

我希望任何人都可以帮我处理任何23或建议另一个库可以解决这个提取问题。

I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.

编辑：我通过使用与any23命令行工具相同的方式自行找到解决方案。以下是代码片段：

Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:

HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");

这些代码行只从HTML中提取微数据并以JSON格式编写。我尝试使用MicrodataExtractor，它可以将输出格式更改为其他格式（Rdf，turtle，...），但输入文档似乎只接受XML格式。当我输入HTML文档时，它会抛出文档未启动。

These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.

如果有人找到了使用MicrodataExtractor的方法，请在此处留下答案。
谢谢。

If anyone found the way to use MicrodataExtractor, please leave the answer here. Thank you.

在Java中从HTML中提取微数据 [英] Microdata extraction from HTML in Java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在Java中从HTML中提取微数据 [英] Microdata extraction from HTML in Java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭