阅读维基百科管道链接 [英] Read Wikipedia piped links

查看:27
本文介绍了阅读维基百科管道链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 java,我想从具有特定表面形式的 Wikipedia 中读取管道链接.以这种形式[America|US]为例,表面形式是US",内部链接是America".

I'm using java and I want to read piped links from Wikipedia that has a specific surface form. Fir example in this form [America|US] the surface form is "US" and the internal link is "America".

直接的解决方案是读取 Wikipedia 的 xml 转储并找到与管道链接的正则表达式匹配的字符串.但是我担心我不会涵盖管道链接的所有可能的正则表达式.我进行了搜索,但找不到任何专门为我提供管道链接的库.

The straightforward solution is to read the xml dump of Wikipedia and find the strings that matches the regular expression for a piped link. However I am afraid that I wouldn't cover all the possible regular expressions of a piped link. I searched and I couldn't find any library that specifically give me the piped links.

有什么建议吗?

推荐答案

编辑

现在我明白了这个问题:我认为没有办法获得所有内部链接及其打印输出值.这根本没有存储在数据库(仅links 是),因为实际输出仅在呈现页面时创建.

Now that I understand the question: I don't think there is a way to get all internal links with their printout value. This is simply not stored in the database (only links are), because the actual output is only created when the page is rendered.

您必须自己解析页面以确保获得所有链接.当然,如果您可以接受仅获取每个页面的 wikitext 中可用链接的子集,那么按照您的建议解析 xml 转储就可以了.请注意,单个正则表达式很可能无法区分管道内部链接和管道维基间链接.还要注意使用管道进行变量分隔的图像链接(例如 [[Image:MyImage.jpeg|thumb|left|A caption!]]).

You would have to parse the pages yourself to be sure to get all links. Of course, if you can accept getting only the subset of links available in the wikitext of each page, parsing the xml dump as you suggests would work. Note that one single regex will most likely not distinguish between piped internal links, and piped interwiki links. Also beware of image links, that use pipes for variable separation (e.g. [[Image:MyImage.jpeg|thumb|left|A caption!]]).

这是 :

Here is the regex used by the MediaWiki parser:

$tc = Title::legalChars() . '#%';
# Match a link having the form [[namespace:link|alternate]]trail
$e1 = "/^([{$tc}]+)(?:\\|(.+?))?]](.*)\$/sD";
# Match cases where there is no "]]", which might still be images
$e1_img = "/^([{$tc}]+)\\|(.*)\$/sD";

但是,这些代码是在经过大量预处理后才应用的.

However, this codes is applied after a lot of preprocessing has happened.

旧答案

使用 xml 转储不会为您提供所有链接,因为许多链接是由 模板生成的,或者在某些情况下甚至 解析器函数.更简单的方法是使用 API:

Using a xml dump will not give you all links, as many links are produced by templates, or in some cases even parser functions. A simpler way would be to use the API:

https://en.wikipedia.org/w/api.php?action=query&titles=Stack_Overflow&prop=links&redirects

我在这里假设是英文维基百科,但它可以在任何地方使用,只需将 url 中的 en. 替换为您的语言代码.很明显,redirects 指令将确保遵循重定向.同理,使用prop=extlinks获取外部链接:

I am assuming English Wikipedia here, but it will work anywhere, just substitute en. in the url for your language code. The redirects directive will, quite obviously, make sure to follow redirects. In the same way, use prop=extlinks to get external links:

https://en.wikipedia.org/w/api.php?action=query&titles=Stack_Overflow&prop=extlinks&redirects

您可以一次获取多个页面的链接,方法是用管道字符分隔它们的名称,如下所示:Stack_Overflow|Chicago,或者使用生成器,例如allpages(针对wiki),像这样:

You can grab links for multiple pages at once, either by separating their name with a pipe character, like this: Stack_Overflow|Chicago, or by using a generator, e.g. allpages (to run the query against every single page in the wiki), like this:

https://en.wikipedia.org/w/api.php?action=query&generator=allpages&prop=links

allpages 生成器返回的结果数量可以通过设置 gaplimit 参数来增加,例如&gaplimit=50 获取前 50 页的所有外部链接.如果您在正在查看的维基百科版本中请求机器人状态,您可以获得同样高的为每个请求 5000 个结果,否则大多数(可能所有)维基百科的最大值为 500.

The number of results returned by the allpages generator can be raise by setting the gaplimit parameter, e.g. &gaplimit=50 to get all external links for the first 50 pages. If you request bot status at the Wikipedia edition you are looking at, you can get as high as 5000 results per request, otherwise the maximum is 500 for most (probably all) Wikipedias.

这篇关于阅读维基百科管道链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆