在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果? [英] Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

查看:29
本文介绍了在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我帖子的后续使用Python 在 Twitter 中抓取嵌套的 Div 和 Span?.

我没有使用 Twitter API,因为它不查看推文很久以前的标签.完整的代码和输出在示例之后如下.

我想从每条推文中抓取特定数据.namehandle 正在检索我正在寻找的内容,但我无法缩小其余元素的范围.

举个例子:

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})网址 = 链接 [0]

检索:

 <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 2015 年 9 月 29 日"><span class="_timestamp js-short-timestamp" data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">;2015 年 9 月 29 日</span></a>

对于 url,我只需要第一行的 href 值.

类似地,retweetsfavorites 命令返回大块的 html,而我真正需要的是为每个显示的数值.

如何将结果范围缩小到 url、retweetcount 和 favcount 输出所需的数据?

我计划在我开始工作后对所有推文进行循环,以防这对您的建议产生影响.

完整代码:

 from bs4 import BeautifulSoup进口请求导入系统url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}r = requests.get(url, headers=headers)数据 = r.text.encode('utf-8')汤 = BeautifulSoup(data, "html.parser")名称 = 汤('strong',{'class':'fullname js-action-profile-name show-popup-with-id'})用户名 = 名称[0].内容[0]句柄=汤('span',{'class':'用户名js-action-profile-name'})userhandle = handle[0].contents[1].contents[0]链接 = 汤('a',{'class':'tweet-timestamp js-permalink js-nav js-tooltip'})网址 = 链接 [0]messagetext = 汤('p',{'class':'TweetTextSize js-tweet-text tweet-text'})消息 = 消息文本[0]转推 = 汤('button',{'class':'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})转推计数 = 转推[0]收藏夹 = 汤('button',{'class':'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})收藏夹 = 收藏夹 [0]打印(用户名、
"、@"、用户句柄、
"、
"、url、
"、
"、消息、
"、
"、转发计数, "
", "
", favcount) #额外的换行符以方便阅读

完整输出:

迈克尔皮尔@Mikepeeljourno<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 2015 年 9 月 29 日"><span class="_timestampjs-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a><p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag Pretty-link js-nav"data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></>案例结束:<a class="twitter-hashtag Pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>泰国</b></a>警察局长宣布 <a class="twitter-hashtag 漂亮链接 js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>曼谷爆炸案</strong></b></a>本周在他退休之前解决了 - 即使关于案件的问题越来越多</p><button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button"><div class="IconContainer js-tooltip" title="撤消转发"><span class="图标图标--转发"></span><span class="u-hiddenVisually">转推</span>

<div class="IconTextContainer"><span class="ProfileTweet-actionCount"><span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span></span>

<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button"><div class="IconContainer js-tooltip" title="撤消赞"><div class="HeartAnimationContainer"><div class="HeartAnimation"></div>

<span class="u-hiddenVisually">喜欢</span>

<div class="IconTextContainer"><span class="ProfileTweet-actionCount"><span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span></span>

有人建议 BeautifulSoup - 提取属性值 可能有这个问题的答案.但是,我认为该问题及其答案没有足够的上下文或解释,无法在更复杂的情况下提供帮助.但是,指向 Beautiful Soup 文档相关部分的链接很有帮助,http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

解决方案

使用类似字典的访问Tag 的属性.

例如获取href属性值:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})url = 链接[0]["href"]

或者,如果您需要为找到的每个链接获取 href 值:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})urls = [link["href"] 用于链接中的链接]

<小时>

作为旁注,您不需要指定完整的 class 值来定位元素.class 是一个特殊的多值属性,您可以只使用其中一个类(如果这足以缩小对所需元素的搜索范围).例如,代替:

soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})

您可以使用:

soup('a', {'class': 'tweet-timestamp'})

或者,CSS 选择器:

soup.select("a.tweet-timestamp")

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.

I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back. Complete code and output are below after examples.

I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.

As an example:

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

Retrieves this:

 <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
 <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

For url, I only need the href value from the first line.

Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.

How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?

I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.

Complete Code:

 from bs4 import BeautifulSoup
 import requests
 import sys

 url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
 r = requests.get(url, headers=headers)
 data = r.text.encode('utf-8')
 soup = BeautifulSoup(data, "html.parser")

 name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
 username = name[0].contents[0]

 handle = soup('span', {'class': 'username js-action-profile-name'})
 userhandle = handle[0].contents[1].contents[0]

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

 messagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
 message = messagetext[0]

 retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
 retweetcount = retweets[0]

 favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
 favcount = favorites[0]

 print (username, "
", "@", userhandle, "
", "
", url, "
", "
", message, "
", "
", retweetcount, "
", "
", favcount) #extra linebreaks for ease of reading

Complete Output:

Michael Peel

@Mikepeeljourno

<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>

<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>

<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>

It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

解决方案

Use the dictionary-like access to the Tag's attributes.

For example, to get the href attribute value:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]

Or, if you need to get the href values for every link found:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]


As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:

soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})

You may use:

soup('a', {'class': 'tweet-timestamp'})

Or, a CSS selector:

soup.select("a.tweet-timestamp")

这篇关于在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆