刮取维基百科信息框的一部分 [英] Scraping part of a Wikipedia Infobox

查看:28
本文介绍了刮取维基百科信息框的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python 2.7, requests &BeautifulSoup将刮掉大约50个Wikipedia页面.我在数据框中创建了一个列,该列具有与每首歌曲的名称相关的部分URL(这些内容先前已经过验证,对所有歌曲进行测试时都会得到响应代码200).

I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name of each song (these have been verified previously and I'm getting response code 200 when testing against all of them).

我的代码循环遍历并将这些单独的URL附加到主Wikipedia URL.我已经能够获得页面或其他数据的标题,但是我真正想要的只是歌曲的长度(不需要其他所有内容).歌曲长度包含在一个信息框中(示例在这里: http://en.wikipedia.org/wiki/No_One_Knows)

My code loops through and appends these individual URL's to the main Wikipedia URL. I've been able to get the heading of the page or other data, but what I really want is the Length of the song only (don't need everything else). The song length is contained within an infobox (example here: http://en.wikipedia.org/wiki/No_One_Knows)

我的代码要么拖拉页面上的所有内容,要么什么都没有.我认为主要问题是我在下面加下划线的地方(即mt = ...)-我在此处放置了不同的html标签,但是我什么也没回来或大部分页面都没有.

My code either drags through everything on the page or nothing at all. I think that the main problem is the bit where I have underlined below (i.e. mt = ...) - I put different html tags in here but I either get nothing back or most of the page.

xyz = df.lengthlink  
#column in a dataframe containing partial strings to append to the main Wikipedia url

def songlength():
    url = ('http://en.wikipedia.org/wiki/' + xyz)
    resp = requests.get(url)
    page = resp.content
    take = BeautifulSoup(page)
    mt = take.find_all(____________)
    sign = mt
    return xyz, sign

for xyz in df.lengthlink:
    print songlength()

编辑后添加:使用Martijn的以下建议可用于单个网址(即No_One_Knows),但不适用于我的多个链接.它抛出了这个随机错误.

Edited to Add: Using Martijn's suggestion below worked for the single url (i.e. No_One_Knows) but not for my multiple links. It threw up this random error.

InvalidSchema                             Traceback (most recent call last)
<ipython-input-166-b5a10522aa27> in <module>()
      2 xyz = df.lengthlink 
      3 url = 'http://en.wikipedia.org/wiki/' + xyz
----> 4 resp = requests.get(url, params={'action': 'raw'})
      5 page = resp.text
      6 

C:\Python27\lib\site-packages\requests\api.pyc in get(url, **kwargs)
     63 
     64     kwargs.setdefault('allow_redirects', True)
---> 65     return request('get', url, **kwargs)
     66 
     67 

C:\Python27\lib\site-packages\requests\api.pyc in request(method, url,    **kwargs)
     47 
     48     session = sessions.Session()
---> 49     response = session.request(method=method, url=url, **kwargs)
     50     # By explicitly closing the session, we avoid leaving sockets open which
     51     # can trigger a ResourceWarning in some cases, and look like a memory leak

C:\Python27\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    459         }
    460         send_kwargs.update(settings)
--> 461         resp = self.send(prep, **send_kwargs)
    462 
    463         return resp

C:\Python27\lib\site-packages\requests\sessions.pyc in send(self, request, **kwargs)
    565 
    566         # Get the appropriate adapter to use
--> 567         adapter = self.get_adapter(url=request.url)
    568 
    569         # Start time (approximately) of the request

C:\Python27\lib\site-packages\requests\sessions.pyc in get_adapter(self, url)
    644 
    645         # Nothing matches :-/
--> 646         raise InvalidSchema("No connection adapters were found for '%s'" % url)
    647 
    648     def close(self):

InvalidSchema: No connection adapters were found for '1     http://en.wikipedia.org/wiki/Locked_Out_of_Heaven
 2     http://en.wikipedia.org/wiki/No_One_Knows
 3     http://en.wikipedia.org/wiki/Given_to_Fly
 4     http://en.wikipedia.org/wiki/Nothing_as_It_Seems  

Name: lengthlink, Length: 50, dtype: object'

推荐答案

不是尝试解析HTML输出,而是尝试解析页面的原始MediaWiki源.第一行以 |开头长度包含您要查找的信息:

Rather than try and parse the HTML output, try and parse the raw MediaWiki source for the page; the first line that starts with | Length contains the information you are looking for:

url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
    if line.startswith('| Length'):
       length = line.partition('=')[-1].strip()
       break

演示:

>>> import requests
>>> xyz = 'No_One_Knows'
>>> url = 'http://en.wikipedia.org/wiki/' + xyz
>>> resp = requests.get(url, params={'action': 'raw'})
>>> page = resp.text
>>> for line in page.splitlines():
...     if line.startswith('| Length'):
...        length = line.partition('=')[-1].strip()
...        break
... 
>>> print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>

您可以根据需要对此进行进一步处理,以提取更丰富的数据(收音机编辑与*专辑版本).

You can further process this to extract the richer data here (the Radio edit vs. *Album version) as required.

这篇关于刮取维基百科信息框的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆