如何从某些文章中获取完整的Wikipedia修订历史列表? [英] How to get full Wikipedia revision-history list from some article?

查看:115
本文介绍了如何从某些文章中获取完整的Wikipedia修订历史列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何获取完整的Wikipedia修订历史列表? (不想刮擦)

How can I get the full Wikipedia revision-history list? (Don't want to scrape)

import wapiti
import pdb
import pylab as plt  
client = wapiti.WapitiClient('mahmoudrhashemi@gmail.com')
get_revs = client.get_page_revision_infos( 'Coffee', 1000000)
print len(gen_revs)

500

打包链接: https://github.com/mahmoud/wapiti

推荐答案

如果您需要500多个修订条目,则必须使用

If you need more than 500 revision entries you will have to use MediaWiki API with action query, property revisions and parameter rvcontinue, which is taken from the previous request, so you can't get the whole list only with one request:

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Coffee&rvcontinue=...

要获取您选择的更多具体信息,您还必须使用 rvprop 参数:

To get more specific information of your choice you'll have to use also rvprop parameter:

&rvprop=ids|flags|timestamp|user|userid|size|sha1|contentmodel|comment|parsedcomment|content|tags|parsetree|flagged

所有可用参数的摘要,您可以在此处找到.

Summary of all available parameters you can find here.

这是如何在C#中获取完整的Wikipedia页面修订历史记录:

This is how to get the full Wikipedia's page revision history in C#:

private static List<XElement> GetRevisions(string pageTitle)
{
    var url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle;
    var revisions = new List<XElement>();
    var next = string.Empty;
    while (true)
    {
        using (var webResponse = (HttpWebResponse)WebRequest.Create(url + next).GetResponse())
        {
            using (var reader = new StreamReader(webResponse.GetResponseStream()))
            {
                var xElement = XElement.Parse(reader.ReadToEnd());
                revisions.AddRange(xElement.Descendants("rev"));

                var cont = xElement.Element("continue");
                if (cont == null) break;

                next = "&rvcontinue=" + cont.Attribute("rvcontinue").Value;
            }
        }
    }

    return revisions;
}

当前对于咖啡" ,这将返回 10 414 个修订版本.

Currently for "Coffee" this returns 10 414 revisions.

这是Python版本:

Here is a Python version:

import urllib2
import re

def GetRevisions(pageTitle):
    url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles=" + pageTitle
    revisions = []                                        #list of all accumulated revisions
    next = ''                                             #information for the next request
    while True:
        response = urllib2.urlopen(url + next).read()     #web request
        revisions += re.findall('<rev [^>]*>', response)  #adds all revisions from the current request to the list

        cont = re.search('<continue rvcontinue="([^"]+)"', response)
        if not cont:                                      #break the loop if 'continue' element missing
            break

        next = "&rvcontinue=" + cont.group(1)             #gets the revision Id from which to start the next request

    return revisions;

您如何看待逻辑是完全相同的.与C#的区别在于,在C#中,我解析了XML响应,在这里我使用正则表达式来匹配其中的所有revcontinue元素.

How you see the logic is absolutely the same. The difference with C# is that in C# I parsed the XML response and here I use regex to match the all rev and continue elements from it.

因此,我的想法是制作一个

So, the idea is that I make a main request from which I get all revisions (the maximum is 500) into revisions array. Also I check the continue xml element to know are there more revisions, get the value of the rvcontinue property and use it in next variable (for this example from the first request it is 20150127211200|644458070) to make another request to take the next 500 revisions. I repeat all this until the continue element is available. If it missing, this means that no more revisions after the last one in the response's revision list are left, so I exit from the loop.

revisions = GetRevisions("Coffee")

print(len(revisions))
#10418

以下是"Coffee" 文章的最新10个修订版(它们是从API反向返回的),并且不要忘记,如果您需要更具体的修订版信息,可以使用rvprop您的请求中的参数.

Here are the last 10 revisions for "Coffee" article (they are returned from the API in reversed order), and don't forget that if you need more specific revision information you can use rvprop parameter in your request.

for i in revisions[0:10]:
    print(i)

#<rev revid="698019402" parentid="698018324" user="Termininja" timestamp="2016-01-03T13:51:27Z" comment="short link" />
#<rev revid="698018324" parentid="697691358" user="AXRL" timestamp="2016-01-03T13:39:14Z" comment="/* See also */" />
#<rev revid="697691358" parentid="697690475" user="Zekenyan" timestamp="2016-01-01T05:31:33Z" comment="first coffee trade" />
#<rev revid="697690475" parentid="697272803" user="Zekenyan" timestamp="2016-01-01T05:18:11Z" comment="since country of origin is not first sighting of someone drinking coffee I have removed the origin section completely" />
#<rev revid="697272803" parentid="697272470" minor="" user="Materialscientist" timestamp="2015-12-29T11:13:18Z" comment="Reverted edits by [[Special:Contribs/Media3dd|Media3dd]] ([[User talk:Media3dd|talk]]) to last version by Materialscientist" />
#<rev revid="697272470" parentid="697270507" user="Media3dd" timestamp="2015-12-29T11:09:14Z" comment="/* External links */" />
#<rev revid="697270507" parentid="697270388" minor="" user="Materialscientist" timestamp="2015-12-29T10:45:46Z" comment="Reverted edits by [[Special:Contribs/89.197.43.130|89.197.43.130]] ([[User talk:89.197.43.130|talk]]) to last version by Mahdijiba" />
#<rev revid="697270388" parentid="697265765" user="89.197.43.130" anon="" timestamp="2015-12-29T10:44:02Z" comment="/* See also */" />
#<rev revid="697265765" parentid="697175433" user="Mahdijiba" timestamp="2015-12-29T09:45:03Z" comment="" />
#<rev revid="697175433" parentid="697167005" user="EvergreenFir" timestamp="2015-12-28T19:51:25Z" comment="Reverted 1 pending edit by [[Special:Contributions/2.24.63.78|2.24.63.78]] to revision 696892548 by Zefr: [[WP:CENTURY]]" />

这篇关于如何从某些文章中获取完整的Wikipedia修订历史列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆