如何从MediaWiki API获取主体映像? [英] How can I get the principal image from MediaWiki API?

查看:179
本文介绍了如何从MediaWiki API获取主体映像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好我使用Curl从维基百科获取信息,我想只接收有关主体图像的信息,我不想接收文章的所有图像..
例如..
如果我想获取有关所有英语语言图片的信息( http://en.wikipedia。 org / wiki / English_language )我应该去这个网址:
http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&prop=images
但我收到的标记以XML格式说英语的国家/地区:

 <?xml version =1.0?& < api> < query> 
< normalized>
< n from =English_languageto =英语语言/>
< / normalized>
< pages>
< page pageid =8569916ns =0title =English language>
< images>
< imns =6title =File:Anglospeak(800px)Countries.png/>
< imns =6title =File:Anglospeak.svg/>
< imns =6title =文件:Circle frame.svg/>
< imns =6title =File:Commons-logo.svg/>
< imns =6title =File:Flag of Argentina.svg/>
< imns =6title =File:Aruba.svg Flag/>
< imns =6title =File:Australia of svg/>
< imns =6title =文件:Bolivia.svg的旗帜/>
< imns =6title =File:Brazil of Brazil.svg/>
< imns =6title =文件:Canada.svg的标志/>

我只想要有关主体映像的信息。

解决方案

正如其他人所指出的,维基百科文章并没有真正的主要图像这样的东西,所以你的第一个问题将决定如何选择不同的图像在给定页面上使用。一些可能的选择标准可能是:




  • 文章中最大的图片。

  • 特定最小尺寸,例如60× 60像素。

  • 第一张图片直接在文章的原文中引用,而不是通过模板。






对于前两个选项,您需要通过 action = parse ,并使用HTML解析器找到 img 代码中的标签,如下所示:



http://en.wikipedia.org/w/api.php?action=parse&page=English_language&prop=text|images < a>



(您无法直接从API获取图片大小的原因是,该信息并不是存储在MediaWiki数据库中的任何地方。)






最后一个选项是文章的source wikitext,可通过 prop = revisions rvprop = content 获得:



http:// en.wikipedia.org/w/api.php?action=query&titles=English_language&prop=revisions|images&rvprop=content



请注意,许多图像在infoboxes等等被指定为 到模板,所以只是解析 [[Image:...]] 他们中有一些。更好的解决方案可能是通过 prop = images (您可以在同一个查询中执行,如上所示)获取页面上使用的所有图像的列表,在wikitext中查找他们的名字(有或没有图像: / 文件:前缀)。



记住MediaWiki自动将页面(和图像)名称标准化的各种方式:最显着的是,下划线被映射到空格,连续的空格被折叠到单个空格,第一个字母名称大写。如果你决定这样做,这里有一些示例PHP代码,将文件名列表转换为一个regexp,应该匹配wikitext中的任何一个:

  foreach($ names as& $ name){
$ name = trim(preg_replace('/ [_ \s] + / u' ',$ name));
$ name = preg_quote($ name,'/');
$ name = preg_replace('/^(\\\\?.)/us','(?i:$ 1)',$ name);
$ name = preg_replace('/ \\\\?/ u','[_\s] +',$ name);
}
$ regexp ='/'。 implode('|',$ names)。 '/ u';

例如,当给定列表时:



< pre class =lang-none prettyprint-override> Anglospeak(800px)Countries.png
Anglospeak.svg
Circle frame.svg
Commons-logo.svg
Argentina.svg的标志
Aruba.svg的标志

生成的正则表达式be:

  /(?i:A)nglospeak \(800px \)Countries\\ \\ png |(?i:A)nglospeak\.svg |(?i:C)ircle [_\s] + frame\.svg |(?i:C)ommons\-logo\。来自[_\ s] +的阿根廷的\\ svg |(?i:F)lag [_ \ s] + ] + Aruba\.svg / u 


Hello I'm using Curl to get information from Wikipedia,and I want to receive only information about the principal image,I don't want to receive all images of an article.. For example.. If I want to get info about all images of the English Language (http://en.wikipedia.org/wiki/English_language) I should go to this URL: http://en.wikipedia.org/w/api.php?action=query&titles=English_Language&prop=images but I receive flags of countries where people speak English in XML:

<?xml version="1.0"?> <api>   <query>
    <normalized>
      <n from="English_language" to="English language" />
    </normalized>
    <pages>
      <page pageid="8569916" ns="0" title="English language">
        <images>
          <im ns="6" title="File:Anglospeak(800px)Countries.png" />
          <im ns="6" title="File:Anglospeak.svg" />
          <im ns="6" title="File:Circle frame.svg" />
          <im ns="6" title="File:Commons-logo.svg" />
          <im ns="6" title="File:Flag of Argentina.svg" />
          <im ns="6" title="File:Flag of Aruba.svg" />
          <im ns="6" title="File:Flag of Australia.svg" />
          <im ns="6" title="File:Flag of Bolivia.svg" />
          <im ns="6" title="File:Flag of Brazil.svg" />
          <im ns="6" title="File:Flag of Canada.svg" />

I only want the information about the principal image.

解决方案

As others have noted, Wikipedia articles don't really have any such thing as a "principal image", so your first problem will be deciding how to choose between the different images used on a given page. Some possible selection criteria might be:

  • Biggest image in the article.
  • First image exceeding some specific minimum dimensions, e.g. 60 × 60 pixels.
  • First image referenced directly in the article's source text, rather than through a template.

For the first two options, you'll want to fetch the rendered HTML code of the page via action=parse and use an HTML parser to find the img tags in the code, like this:

http://en.wikipedia.org/w/api.php?action=parse&page=English_language&prop=text|images

(The reason you can't just get the sizes of the images, as used on the page, directly from the API is that that information isn't actually stored anywhere in the MediaWiki database.)


For the last option, what you want is the source wikitext of the article, available via prop=revisions with rvprop=content:

http://en.wikipedia.org/w/api.php?action=query&titles=English_language&prop=revisions|images&rvprop=content

Note that many images in infoboxes and such are specified as parameters to a template, so just parsing for [[Image:...]] syntax will miss some of them. A better solution is probably to just get the list of all images used on the page via prop=images (which you can do in the same query, as I showed above) and look for their names (with or without Image: / File: prefix) in the wikitext.

Keep in mind the various ways in which MediaWiki automatically normalizes page (and image) names: most notably, underscores are mapped to spaces, consecutive whitespace is collapsed to a single space and the first letter of the name is capitalized. If you decide to go this way, here's some sample PHP code that will convert a list of file names into a regexp that should match any of them in wikitext:

foreach ($names as &$name) {
    $name = trim( preg_replace( '/[_\s]+/u', ' ', $name ) );
    $name = preg_quote( $name, '/' );
    $name = preg_replace( '/^(\\\\?.)/us', '(?i:$1)', $name );
    $name = preg_replace( '/\\\\? /u', '[_\s]+', $name );
}
$regexp = '/' . implode( '|', $names ) . '/u';

For example, when given the list:

Anglospeak(800px)Countries.png
Anglospeak.svg
Circle frame.svg
Commons-logo.svg
Flag of Argentina.svg
Flag of Aruba.svg

the generated regexp will be:

/(?i:A)nglospeak\(800px\)Countries\.png|(?i:A)nglospeak\.svg|(?i:C)ircle[_\s]+frame\.svg|(?i:C)ommons\-logo\.svg|(?i:F)lag[_\s]+of[_\s]+Argentina\.svg|(?i:F)lag[_\s]+of[_\s]+Aruba\.svg/u

这篇关于如何从MediaWiki API获取主体映像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆