从维基百科解析出生和死亡日期? [英] Parse birth and death dates from Wikipedia?

查看:171
本文介绍了从维基百科解析出生和死亡日期?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个 python 程序,该程序可以在维基百科中搜索人们的出生和死亡日期.

I'm trying to write a python program that can search wikipedia for the birth and death dates for people.

例如,阿尔伯特·爱因斯坦出生于:1879 年 3 月 14 日;逝世日期:1955 年 4 月 18 日.

For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.

我从使用 Python 获取维基百科文章

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml')
page2 = infile.read()

就目前而言,这有效.page2 是 Albert Einstein's wikipedia page 部分的 xml 表示.

This works as far as it goes. page2 is the xml representation of the section from Albert Einstein's wikipedia page.

我看了这个教程,现在我有 xml 格式的页面... http://www.travisglines.com/web-coding/python-xml-parser-tutorial,但我不明白如何获取我想要的信息(出生和死亡日期)) 出来的 xml.我觉得我必须接近,但我不知道如何从这里开始.

And I looked at this tutorial, now that I have the page in xml format... http://www.travisglines.com/web-coding/python-xml-parser-tutorial, but I don't understand how to get the information I want (birth and death dates) out of the xml. I feel like I must be close, and yet, I have no idea how to proceed from here.

编辑

经过几次回复后,我安装了 BeautifulSoup.我现在处于可以打印的阶段:

After a few responses, I've installed BeautifulSoup. I'm now at the stage where I can print:

import BeautifulSoup as BS
soup = BS.BeautifulSoup(page2)
print soup.getText()
{{Infobox scientist
| name        = Albert Einstein
| image       = Einstein 1921 portrait2.jpg
| caption     = Albert Einstein in 1921
| birth_date  = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date  = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States
| spouse      = [[Mileva Marić]] (1903–1919)<br>{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
| residence   = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* [[Austria–Hungary|Austria]] (1911–1912)
* [[German Empire|Germany]] (1914–1933)
* United States (1940–1955)
}}

所以,更接近,但我仍然不知道如何以这种格式返回death_date.除非我开始用 re 解析东西?我可以这样做,但我觉得我在这项工作中使用了错误的工具.

So, much closer, but I still don't know how to return the death_date in this format. Unless I start parsing things with re? I can do that, but I feel like I'd be using the wrong tool for this job.

推荐答案

您可以考虑使用诸如 BeautifulSouplxml 解析响应 html/xml.

You can consider using a library such as BeautifulSoup or lxml to parse the response html/xml.

您可能还想看看Requests,它有一个更简洁的 API 来发出请求.

You may also want to take a look at Requests, which has a much cleaner API for making requests.

这是使用 RequestsBeautifulSoupre 的工作代码,可以说不是最好的解决方案,但它非常灵活并且可以针对类似问题进行扩展:

Here is the working code using Requests, BeautifulSoup and re, arguably not the best solution here, but it is quite flexible and can be extended for similar problems:

import re
import requests
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml'

res = requests.get(url)
soup = BeautifulSoup(res.text, "xml")

birth_re = re.search(r'(Birth date(.*?)}})', soup.revisions.getText())
birth_data = birth_re.group(0).split('|')
birth_year = birth_data[2]
birth_month = birth_data[3]
birth_day = birth_data[4]

death_re = re.search(r'(Death date(.*?)}})', soup.revisions.getText())
death_data = death_re.group(0).split('|')
death_year = death_data[2]
death_month = death_data[3]
death_day = death_data[4]

<小时>

根据@JBernardo 的建议,使用 JSON 数据和 mwparserfromhell,这是针对此特定用例的更好答案:


Per @JBernardo's suggestion using JSON data and mwparserfromhell, a better answer for this particular use case:

import requests
import mwparserfromhell

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json'

res = requests.get(url)
text = res.json["query"]["pages"].values()[0]["revisions"][0]["*"]
wiki = mwparserfromhell.parse(text)

birth_data = wiki.filter_templates(matches="Birth date")[0]
birth_year = birth_data.get(1).value
birth_month = birth_data.get(2).value
birth_day = birth_data.get(3).value

death_data = wiki.filter_templates(matches="Death date")[0]
death_year = death_data.get(1).value
death_month = death_data.get(2).value
death_day = death_data.get(3).value

这篇关于从维基百科解析出生和死亡日期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆