BeautifulSoup:刮痧有源$ C $ C组相同的属性不同的数据集 [英] BeautifulSoup: Scraping different data sets having same set of attributes in the source code
问题描述
我使用的是从一个Twitter帐户的追随者刮的总数和鸣叫总数的 BeautifulSoup
模块。然而,当我试图检查各自领域的内容网页上,我发现,无论是场被封闭内同一组HTML属性:
I'm using the BeautifulSoup
module for scraping the total number of followers and total number of tweets from a Twitter account. However, when I tried inspecting the elements of the respective fields on the web page, I found that both the fields are enclosed inside same set of html attributes:
关注
<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" data-nav="followers" href="/IAmJericho/followers" data-original-title="2,469,681 Followers">
<span class="ProfileNav-label">Followers</span>
<span class="ProfileNav-value" data-is-compact="true">2.47M</span>
</a>
分享Tweet计数
Tweet count
<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" data-nav="tweets" tabindex="0" data-original-title="21,769 Tweets">
<span class="ProfileNav-label">Tweets</span>
<span class="ProfileNav-value" data-is-compact="true">21.8K</span>
</a>
这是我写的剧本开采:
import requests
import urllib2
from bs4 import BeautifulSoup
link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for e in res.findAll('span', {'data-is-compact':'true'}):
followers = e.text
print followers
然而,由于两者的价值,总鸣叫计数和追随者的总数被封闭同一组HTML里面的属性,即范围内
标记类=ProfileNav价值
和数据是紧凑型=真正的
,我只得到了总数的结果追随者数量返回运行上面的脚本。
However, since the values of both, the total tweet count and total number of followers are enclosed inside same set of HTML attributes, ie inside a span
tag with class = "ProfileNav-value"
and data-is-compact = "true"
, I only get the results of the total number of followers returned by running the above script.
怎么可能提取两组信息封闭的类似HTML从BeautifulSoup属性?在
How could I possibly extract two sets of information enclosed inside similar HTML attributes from BeautifulSoup?
推荐答案
在此情况下,一个方法去实现它,是检查数据是紧凑型=真正的
仅出现两次,每次要提取每一块数据,并且你也知道,鸣叫
是第一和追随者
第二,这样你就可以在同一顺序的标题列表,并使用拉链
来加入他们的元组在同一时间同时打印,如:
In this case, one way to achieve it, is to check that data-is-compact="true"
only appears twice for each piece of data you want to extract, and also you know that tweets
is first and followers
second, so you can have a list with those titles in same order and use a zip
to join them in a tuple to print both at same time, like:
import urllib2
from bs4 import BeautifulSoup
profile = ['Tweets', 'Followers']
link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for p, d in zip(profile, res.find_all('span', { 'data-is-compact': "true"})):
print p, d.text
它产生的:
Tweets 21,8K
Followers 2,47M
这篇关于BeautifulSoup:刮痧有源$ C $ C组相同的属性不同的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!