BeautifulSoup:刮痧有源$ C ​​$ C组相同的属性不同的数据集 [英] BeautifulSoup: Scraping different data sets having same set of attributes in the source code

查看:186
本文介绍了BeautifulSoup:刮痧有源$ C ​​$ C组相同的属性不同的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是从一个Twitter帐户的追随者刮的总数和鸣叫总数的 BeautifulSoup 模块。然而,当我试图检查各自领域的内容网页上,我发现,无论是场被封闭内同一组HTML属性:

I'm using the BeautifulSoup module for scraping the total number of followers and total number of tweets from a Twitter account. However, when I tried inspecting the elements of the respective fields on the web page, I found that both the fields are enclosed inside same set of html attributes:

关注

<a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" data-nav="followers" href="/IAmJericho/followers" data-original-title="2,469,681 Followers">
          <span class="ProfileNav-label">Followers</span>
          <span class="ProfileNav-value" data-is-compact="true">2.47M</span>
</a>

分享Tweet计数

Tweet count

    <a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav" data-nav="tweets" tabindex="0" data-original-title="21,769 Tweets">
                <span class="ProfileNav-label">Tweets</span>
                <span class="ProfileNav-value" data-is-compact="true">21.8K</span>
</a>

这是我写的剧本开采:

import requests
import urllib2
from bs4 import BeautifulSoup

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for e in res.findAll('span', {'data-is-compact':'true'}):
    followers = e.text

print followers 

然而,由于两者的价值,总鸣叫计数和追随者的总数被封闭同一组HTML里面的属性,即范围内标记类=ProfileNav价值数据是紧凑型=真正的,我只得到了总数的结果追随者数量返回运行上面的脚本。

However, since the values of both, the total tweet count and total number of followers are enclosed inside same set of HTML attributes, ie inside a span tag with class = "ProfileNav-value" and data-is-compact = "true", I only get the results of the total number of followers returned by running the above script.

怎么可能提取两组信息封闭的类似HTML从BeautifulSoup属性?在

How could I possibly extract two sets of information enclosed inside similar HTML attributes from BeautifulSoup?

推荐答案

在此情况下,一个方法去实现它,是检查数据是紧凑型=真正的仅出现两次,每次要提取每一块数据,并且你也知道,鸣叫是第一和追随者第二,这样你就可以在同一顺序的标题列表,并使用拉链来加入他们的元组在同一时间同时打印,如:

In this case, one way to achieve it, is to check that data-is-compact="true" only appears twice for each piece of data you want to extract, and also you know that tweets is first and followers second, so you can have a list with those titles in same order and use a zip to join them in a tuple to print both at same time, like:

import urllib2
from bs4 import BeautifulSoup

profile = ['Tweets', 'Followers']

link = "https://twitter.com/iamjericho"
r = urllib2.urlopen(link)
src = r.read()
res = BeautifulSoup(src)
followers = ''
for p, d in zip(profile, res.find_all('span', { 'data-is-compact': "true"})):
    print p, d.text

它产生的:

Tweets 21,8K                                                                                                                                                                                                                                                                   
Followers 2,47M

这篇关于BeautifulSoup:刮痧有源$ C ​​$ C组相同的属性不同的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆