使用 Python 在 Twitter 中抓取嵌套的 Div 和 Span? [英] Using Python to Scrape Nested Divs and Spans in Twitter?

查看:12
本文介绍了使用 Python 在 Twitter 中抓取嵌套的 Div 和 Span?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 Twitter 搜索结果中抓取喜欢和转发的信息.

I'm trying to scrape the likes and retweets from the results of a Twitter search.

运行下面的 Python 后,我得到一个空列表,[].我没有使用 Twitter API,因为它不会通过标签查看这么远的推文.

After running the Python below, I get an empty list, []. I'm not using the Twitter API because it doesn't look at the tweets by hashtag this far back.

我使用的代码是:

from bs4 import BeautifulSoup
import requests

url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)

我可以使用此代码成功地将 html 保存到文件中.搜索文本时缺少大量信息,例如我要查找的类名...

I can successfully save the html to file using this code. It is missing large amounts of information when I search the text, such as the class names I am looking for...

因此(部分)问题显然在于准确访问源代码.

So (part of) the problem is apparently in accurately accessing the source code.

 filename = 'newfile2.txt'
 with open(filename, 'w') as handle:
      handle.writelines(str(data))

此屏幕截图显示了我尝试抓取的跨度.

This screenshot shows the span that I'm trying to scrape.

我看过这个问题,其他人也喜欢它,但我还没有完全明白.
如何使用 BeautifulSoup 获得深度嵌套的 div值?

I've looked at this question, and others like it, but I'm not quite getting there.
How can I use BeautifulSoup to get deeply nested div values?

推荐答案

您的 GET 请求似乎返回了有效的 HTML,但 #timeline 元素中没有推文元素.但是,向请求标头添加用户代理似乎可以解决此问题.

It seems that your GET request returns valid HTML but with no tweet elements in the #timeline element. However, adding a user agent to the request headers seems to remedy this.

from bs4 import BeautifulSoup
import requests

url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, "lxml")
all_likes = soup.find_all('span', class_='ProfileTweet-actionCountForPresentation')
print(all_likes)

这篇关于使用 Python 在 Twitter 中抓取嵌套的 Div 和 Span?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆