从维基百科文章中获取第一段 [英] get first paragraph from wikipedia article
问题描述
我正在使用以下代码从维基百科文章中获取第一段.这是我的代码的结果.我只需要这一段.是否可以?或者有没有更好的选择?
I'm using following code to get the first paragraph from a Wikipedia article. Here is the result of my code. I need only this paragraph. Is it possible? Or is there any better alternative?
'''Papori''' ({{lang-as|'''?????'''}}) 是一个 [[阿萨姆语]] 特性由 [[Jahnu Barua]] 导演的电影.影片主演戈皮德赛,[[Biju Phukan]],Sushil Goswami、Chetana Das 和 Dulal Roy.该片于1986年上映.
'''Papori''' ({{lang-as|'''?????'''}}) is an [[Assamese language]] feature film directed by [[Jahnu Barua]]. The film stars Gopi Desai, [[Biju Phukan]], Sushil Goswami, Chetana Das and Dulal Roy. The film was released in 1986.
这是我的代码:
#!/usr/bin/python
from lxml import etree
import urllib
from BeautifulSoup import BeautifulSoup
class AppURLopener(urllib.FancyURLopener):
version = "WikiDownloader"
urllib._urlopener = AppURLopener()
query = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=papori&rvsection=0'
#data = { 'catname':'', 'wpDownload':1, 'pages':"\n".join(pages)}
#data = urllib.urlencode(data)
f = urllib.urlopen(query)
s = f.read()
#doc = etree.parse(f)
#print(s)
soup = BeautifulSoup(s)
secondPTag = soup.findAll('rev')
print secondPTag
代码更新:任何人帮我删除{{}}
之间的文本.因为没有必要.谢谢
Code Updated: any one help me to remove the text between {{ }}
. Because there is no need. Thanks
推荐答案
要删除从 {{
到 '''Papori'''
的所有内容:
To remove everything from {{
to '''Papori'''
:
import re
regex = re.compile(r"""{{.*?}}\s*('''Papori''')""", re.DOTALL)
print regex.sub(r"\1", rev_data)
要删除从第一个{{"到匹配的}}"的所有内容:
To remove everything from the first "{{" to matching "}}":
prefix, sep, rest = rev_data.partition("{{")
if sep: # found the first "{{"
rest = sep + rest # put it back
while rest.startswith("{{"):
# remove nested "{{expr}}" one by one until there is none
rest, n = re.subn(r"{{(?:[^{]|(?<!{){)*?}}", "", rest, 1)
if n == 0:
break # the first "{{" is unmatched; can't remove it
else: # deletion is successful
rev_data = prefix + rest
print(rev_data)
删除从第一个{{"到匹配的}}"的所有内容,不使用正则表达式:
To remove everything from the first "{{" to matching "}}" without regex:
prefix, sep, rest = rev_data.partition("{{")
if sep: # found the first "{{"
depth = 1
prevc = None
for i, c in enumerate(rest):
if c == "{" and prevc == c: # found "{{"
depth += 1
prevc = None # match "{{{ " only once
elif c == "}" and prevc == c: # found "}}"
depth -= 1
if depth == 0: # found matching "}}"
rev_data = prefix + rest[i+1:] # after matching "}}"
break
prevc = None # match "}}} " only once
else:
prevc = c
print(rev_data)
完整示例
#!/usr/bin/env python
import urllib, urllib2
import xml.etree.cElementTree as etree
# download & parse xml, find rev data
params = dict(action="query", prop="revisions", rvprop="content",
format="xml", titles="papori", rvsection=0)
request = urllib2.Request(
"http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params),
headers={"User-Agent": "WikiDownloader/1.0",
"Referer": "http://stackoverflow.com/q/7937855"})
tree = etree.parse(urllib2.urlopen(request))
rev_data = tree.findtext('.//rev')
# remove everything from the first "{{" to matching "}}"
prefix, sep, rest = rev_data.partition("{{")
if sep: # found the first "{{"
depth = 1
prevc = None
for i, c in enumerate(rest):
if c == "{" and prevc == c: # found "{{"
depth += 1
prevc = None # match "{{{ " only once
elif c == "}" and prevc == c: # found "}}"
depth -= 1
if depth == 0: # found matching "}}"
rev_data = prefix + rest[i+1:] # after matching "}}"
break
prevc = None # match "}}} " only once
else:
prevc = c
print rev_data
输出
'''Papori''' ({{lang-as|'''পাপৰী'''}}) is an [[Assamese
language]] feature film directed by [[Jahnu Barua]]. The film
stars Gopi Desai, [[Biju Phukan]], Sushil Goswami, Chetana Das
and Dulal Roy. The film was released in 1986.<ref name="ab">{{cite
web|url=http://www.chaosmag.in/barua.html|title=Papori – 1986 –
Assamese film|publisher=Chaosmag|accessdate=4 February
2010}}</ref>
这篇关于从维基百科文章中获取第一段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!