如何使用PHP解析Wikipedia XML? [英] How to parse Wikipedia XML with PHP?

查看:77
本文介绍了如何使用PHP解析Wikipedia XML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用PHP解析Wikipedia XML?我尝试过使用simplepie,但一无所获.这是我要获取其数据的链接.

How to parse Wikipedia XML with PHP? I tried it with simplepie, but I got nothing. Here is a link which I want to get its data.

http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml

修改代码:

<?php
    define("EMAIL_ADDRESS", "youlichika@hotmail.com"); 
    $ch = curl_init(); 
    $cv = curl_version(); 
    $user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; 
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent); 
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); 
    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); 
    curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity"); 
    curl_setopt($ch, CURLOPT_HEADER, FALSE); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
    curl_setopt($ch, CURLOPT_HTTPGET, TRUE); 
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); 
    $xml = curl_exec($ch); 
    $xml_reader = new XMLReader(); 
    $xml_reader->xml($xml, "UTF-8"); 
    echo $xml->api->query->pages->page->rev;
?>

推荐答案

我通常结合使用CURL和 XMLReader 来解析由MediaWiki API生成的XML.

I generally use a combination of CURL and XMLReader to parse XML generated by the MediaWiki API.

请注意,您必须在User-Agent标头中包含您的电子邮件地址,否则API脚本将以HTTP 403 Forbidden(HTTP 403禁止访问)作为响应.

Note that you must include your e-mail address in the User-Agent header, or else the API script will respond with HTTP 403 Forbidden.

这是我初始化CURL句柄的方法:

Here is how I initialize the CURL handle:

define("EMAIL_ADDRESS", "my@email.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

然后,您可以使用此代码获取XML,并在$xml_reader中构造一个新的XMLReader对象:

You can then use this code which grabs the XML and constructs a new XMLReader object in $xml_reader:

curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

编辑:这是一个有效的示例:

Here is a working example:

<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">"; 
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml"); 
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

function extract_first_rev(XMLReader $xml_reader)
{
    while ($xml_reader->read()) {
        if ($xml_reader->nodeType == XMLReader::ELEMENT) {
            if ($xml_reader->name == "rev") {
                $content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
                return $content;
            }
        } else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) {
            if ($xml_reader->name == "page") {
                throw new Exception("Unexpectedly found `</page>`");
            }
        }
    }

    throw new Exception("Reached the end of the XML document without finding revision content");
}

$latest_rev = array();
while ($xml_reader->read()) {
    if ($xml_reader->nodeType == XMLReader::ELEMENT) {
        if ($xml_reader->name == "page") {
            $latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader);
        }
    }
}

function parse($rev)
{
    global $ch;

    curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
    curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml");
    sleep(3);
    $xml = curl_exec($ch);
    $xml_reader = new XMLReader();
    $xml_reader->xml($xml, "UTF-8");

    while ($xml_reader->read()) {
        if ($xml_reader->nodeType == XMLReader::ELEMENT) {
            if ($xml_reader->name == "text") {
                $html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
                return $html;
            }
        }
    }

    throw new Exception("Failed to parse");
}

foreach ($latest_rev as $title => $latest_rev) {
    echo parse($latest_rev) . "\n";
}

这篇关于如何使用PHP解析Wikipedia XML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆