在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果? [英] Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

查看：29 发布时间：2021/12/23 20:54:22 python twitter web-scraping beautifulsoup html-parsing

本文介绍了在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我帖子的后续使用Python 在 Twitter 中抓取嵌套的 Div 和 Span?.

我没有使用 Twitter API，因为它不查看推文很久以前的标签.完整的代码和输出在示例之后如下.

我想从每条推文中抓取特定数据.name 和 handle 正在检索我正在寻找的内容，但我无法缩小其余元素的范围.

举个例子:

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})网址 = 链接 [0]

检索:

 <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 2015 年 9 月 29 日"><span class="_timestamp js-short-timestamp" data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">;2015 年 9 月 29 日</span></a>

对于 url，我只需要第一行的 href 值.

类似地，retweets 和 favorites 命令返回大块的 html，而我真正需要的是为每个显示的数值.

如何将结果范围缩小到 url、retweetcount 和 favcount 输出所需的数据?

我计划在我开始工作后对所有推文进行循环，以防这对您的建议产生影响.

完整代码:

 from bs4 import BeautifulSoup进口请求导入系统url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}r = requests.get(url, headers=headers)数据 = r.text.encode('utf-8')汤 = BeautifulSoup(data, "html.parser")名称 = 汤('strong'，{'class':'fullname js-action-profile-name show-popup-with-id'})用户名 = 名称[0].内容[0]句柄=汤('span'，{'class':'用户名js-action-profile-name'})userhandle = handle[0].contents[1].contents[0]链接 = 汤('a'，{'class':'tweet-timestamp js-permalink js-nav js-tooltip'})网址 = 链接 [0]messagetext = 汤('p'，{'class':'TweetTextSize js-tweet-text tweet-text'})消息 = 消息文本[0]转推 = 汤('button'，{'class':'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})转推计数 = 转推[0]收藏夹 = 汤('button'，{'class':'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})收藏夹 = 收藏夹 [0]打印(用户名、
"、@"、用户句柄、
"、
"、url、
"、
"、消息、
"、
"、转发计数, "
", "
", favcount) #额外的换行符以方便阅读

完整输出:

迈克尔皮尔@Mikepeeljourno<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 2015 年 9 月 29 日"><span class="_timestampjs-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a><p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag Pretty-link js-nav"data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></＞案例结束:<a class="twitter-hashtag Pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>泰国</b></a>警察局长宣布 <a class="twitter-hashtag 漂亮链接 js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>曼谷爆炸案</strong></b></a>本周在他退休之前解决了 - 即使关于案件的问题越来越多</p><button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button"><div class="IconContainer js-tooltip" title="撤消转发"><span class="图标图标--转发"></span><span class="u-hiddenVisually">转推</span>

<div class="IconTextContainer"><span class="ProfileTweet-actionCount"><span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span></span>

<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button"><div class="IconContainer js-tooltip" title="撤消赞"><div class="HeartAnimationContainer"><div class="HeartAnimation"></div>

<span class="u-hiddenVisually">喜欢</span>

<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"> <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

from bs4 import BeautifulSoup import requests import sys url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} r = requests.get(url, headers=headers) data = r.text.encode('utf-8') soup = BeautifulSoup(data, "html.parser") name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'}) username = name[0].contents[0] handle = soup('span', {'class': 'username js-action-profile-name'}) userhandle = handle[0].contents[1].contents[0] link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'}) url = link[0] messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'}) message = messagetext[0] retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'}) retweetcount = retweets[0] favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'}) favcount = favorites[0] print (username, " ", "@", userhandle, " ", " ", url, " ", " ", message, " ", " ", retweetcount, " ", " ", favcount) #extra linebreaks for ease of reading

Michael Peel @Mikepeeljourno <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a> <p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p> <button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button"> <div class="IconContainer js-tooltip" title="Undo retweet"> <span class="Icon Icon--retweet"></span> <span class="u-hiddenVisually">Retweeted</span> </div> <div class="IconTextContainer"> <span class="ProfileTweet-actionCount"> <span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span> </span> </div> </button> <button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button"> <div class="IconContainer js-tooltip" title="Undo like"> <div class="HeartAnimationContainer"> <div class="HeartAnimation"></div> </div> <span class="u-hiddenVisually">Liked</span> </div> <div class="IconTextContainer"> <span class="ProfileTweet-actionCount"> <span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span> </span> </div> </button>

在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果? [英] Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在使用 Python 和 Beautiful Soup 4 抓取 Twitter 的同时专注于特定结果? [英] Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭