如何从Wikipedia信息框中提取信息? [英] How to extract information from a Wikipedia infobox?

查看:357
本文介绍了如何从Wikipedia信息框中提取信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在<某些Wikipedia文章>中有信息框.如何获得< this字段和that>的值?

解决方案

错误的方式:尝试解析HTML

使用(cURL/jQuery/file_get_contents/requests/wget/更多jQuery )来获取文章的HTML文章代码,然后使用DOM解析器提取table.infobox tr[3] td/param1 = {{convert|10|km|mi}}之类的东西;模板参数可能包含复杂的Wikitext或HTML标记; Wiki文章中可能缺少某些参数,模板从子页面或其他数据存储库中提取了一些参数.如果它包含其他具有自己参数的模板,那么仅仅弄清楚参数的开始和结束位置可能不是一件容易的事.

理想的方法:使用结构化数据源

有许多项目以结构化形式提供Wikipedia信息框中包含的信息;大型的两个是Wikidata和DBpedia.

Wikidata 是一个旨在建立包含结构化数据的知识库的项目;它由建立Wikipedia的同一全球运动维护,因此信息正在转移中.这是一个手动过程,因此并非Wikipedia中的所有信息都可以通过Wikidata获得,另一方面,Wikidata中有很多信息,而Wikipedia中却没有.您可以通过点击文章页面左侧工具栏中的 Wikidata项目链接找到文章的Wikidata页面,并查看其中包含的信息.以编程方式,您可以使用 wbgetentities API访问Wikidata信息.模块(沙盒概念的解释),例如 wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein .还有一个 SPARQL端点 PHP 中的客户端, Java DBPedia 是一个通过自动化方式收集Wikipedia信息框信息并以结构化形式发布的项目.您可以转到http://dbpedia.org/page/<Wikipedia article name>,例如,找到Wikipedia文章的DBPedia页面. http://dbpedia.org/page/Albert_Einstein .它具有许多数据格式,转储, SPARQL端点 REST内容API (例如 https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein ),它返回的比使用的HTML内容更丰富,更语义化的HTML 在普通文章页面上,并保留了有关模板结构的一些信息.

或者,您可以从Wikitext开始,然后使用更简单的客户端 mwparserfromhell将其解析为语法树. Python模块(文档)或更强大的与Wikipedia REST内容服务交互的Parasoid JS API .

尝试从Wikitext提取信息框内容的高级Python库是 wptools ./p>

There is this fancy infobox in <some Wikipedia article>. How do I get the value of <this field and that>?

解决方案

The wrong way: trying to parse HTML

Use (cURL/jQuery/file_get_contents/requests/wget/more jQuery) to fetch the HTML article code of the article, then use a DOM parser to extract table.infobox tr[3] td / use a regex.

This is actually a really bad idea most of the time. Wikipedia's HTML code is not particularly parsing-friendly (especially infoboxes which are a system of hand-written templates), the exact structure changes from infobox to infobox, and the structure of an infobox might change over time. You might also miss out on some features that would be otherwise available, such as internationalization.

The other wrong way: trying to parse wikitext

At a glance, the wikitext of some articles looks like it's a pretty straightforward representation of the infobox:

{{ Infobox Foo
| param1 = bar
| param2 = 123
...

In reality, that's not the case. Templates are "recursive" so you might run into stuff like param1 = {{convert|10|km|mi}}; template parameters might contain complex wikitext or HTML markup; some parameters might be missing from the article wikitext and fetched by the template from a subpage or other data repository. Just finding out where a parameter starts and ends might not be a simple business if it contains other templates which have their own parameters.

The ideal way: using a structured data source

There are various projects to provide the information contained in Wikipedia infoboxes in a structured form; the two large ones are Wikidata and DBpedia.

Wikidata is a project to build a knowledge base containing structured data; it is maintained by the same global movement that built Wikipedia, so information is in the process of being moved over. This is a manual process, so not all information in Wikipedia is available via Wikidata, on the other hand there is a lot of information that's in Wikidata but not in Wikipedia. You can find the Wikidata page of an article and see what information it contains by following the Wikidata item link in the left-hand toolbar on the article page; programmatically, you can access Wikidata information using the wbgetentities API module (sandbox, explanation of concepts), e.g. wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&titles=Albert_Einstein. There is also a SPARQL endpoint, database dumps, and clients in PHP, Java and Python.

DBPedia is a project to harvest Wikipedia infobox information by automated means and publish it in a structured form. You can find the DBPedia page for a Wikipedia article by going to http://dbpedia.org/page/<Wikipedia article name>, e.g. http://dbpedia.org/page/Albert_Einstein. It has many data formats, dumps, a SPARQL endpoint and various other things.

The wrong ways done right

If the information you need is not available via Wikidata or DBpedia, there are still semi-structured ways of extracting data from infoboxes. For HTML-based extraction you can use Wikipedia's REST content API (e.g. https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein) which returns a richer, more semantic HTML than the one used on normal article pages, and preserves in it some information about template structure.

Alternatively, you might start from wikitext and parse it into a syntax tree using the simpler, client-side mwparserfromhell Python module (docs) or the more powerful Parsoid JS API which interacts with the Wikipedia REST content service.

A higher-level Python library which tries to extract infobox contents from wikitext is wptools.

这篇关于如何从Wikipedia信息框中提取信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆