无限网络抓取 Twitter [英] Infinite Web Scraping Twitter

查看:47
本文介绍了无限网络抓取 Twitter的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 3.X 对 Twitter 进行网络抓取,但我只收集了我请求的最后 20 条推文.我想收集 2006 年至今的请求的全部数据.为此,我认为要创建另外两个功能:一个将收集最后的推文,另一个将收集当前的推文?我如何从这个滚动页面收集数据?我认为我必须使用推文的 id,但无论我提出什么请求,总是我收到的最后 20 条推文.

I'm trying to web scrape Twitter using Python 3.X but I only collect the last 20 tweets of my request. I would like to collect whole data of a request between 2006 and now. For this I think to have create two more function: one which will collect the last tweets and one which will collect the current tweets? And how can I collect the data from this scrolling page? I think that I have to use the tweet's id but no matter the request I do it's always the last 20 tweets that I get.

from pprint import pprint
from lxml import html
import requests
import datetime as dt
from BeautifulSoup import BeautifulSoup

def search_twitter(search):
    url = "https://twitter.com/search?f=tweets&vertical=default&q="+search+"&src=typd&lang=fr"
    request = requests.get(url)
    sourceCode = BeautifulSoup(request.content, "lxml")
    tweets = sourceCode.find_all('li', 'js-stream-item')
    return tweets

def filter_tweets(tweets):
    data = []
    for tweet in tweets:
        if tweet.find('p', 'tweet-text'):
            dtwee = [
                ['id', tweet['data-item-id']],
                ['username', tweet.find('span', 'username').text],
                ['time', tweet.find('a', 'tweet-timestamp')['title']],
                ['tweet', tweet.find('p', 'tweet-text').text.encode('utf-8')]]
            data.append(dtwee)
            #tweet_time = dt.datetime.strptime(tweet_time, '%H:%M - %d %B %Y')
        else:
            continue
    return data

def firstlastId_tweets(tweets):
    firstID = ""
    lastID = ""
    i = 0
    for tweet in tweets:
        if(i == 0):
            firstID = tweet[0][1]
        elif(i == (len(tweets)-1)):
            lastID = tweet[0][1]
        i+=1
    return firstID, lastID

def last_tweets(search, lastID):
    url = "https://twitter.com/search?f=tweets&vertical=default&q="+search+"&src=typd&lang=fr&max_position=TWEET-"+lastID
    request = requests.get(url)
    sourceCode = BeautifulSoup(request.content, "lxml")
    tweets = sourceCode.find_all('li', 'js-stream-item')
    return tweets

tweets = search_twitter("lol")
tweets = filter_tweets(tweets)
pprint(tweets)
firstID, lastID = firstlastId_tweets(tweets)
print(firstID, lastID)
while True:
    lastTweets = last_tweets("lol", lastID)
    pprint(lastTweets)
    firstID, lastID = firstlastId_tweets(lastTweets)
    print(firstID, lastID)

推荐答案

我根据这个网页找到了一个很好的解决方案:

I found a good solution based on this webpage:

http://ataspinar.com/2015/11/09/collecting-data-from-twitter/

我所做的是创建一个名为 max_pos 的变量,我在其中存储了这个字符串:

What I did was creating a variable called max_pos where I stored this string:

'&max_position=TWEET-'+last_id+'-'+first_id

我存储了first_id(position1 Tweet id)和last_id(position20 Tweet id)

所以对于请求,我使用了这样的东西:

So for the request, I used something like this:

request = requests.get(url+max_pos)max_pos empty 开始.

我发现这可能是一个常见问题,我们可以发布一个有效的解决方案.我仍然没有按照我需要的方式显示结果,但我可以按照链接中的指南模拟向下滚动到最后".

I see this can be a common issue, we could post a working solution. I still do not have it showing the results the way I need, but I could simulate the "scroll down till the end" following the guide from the link.

这篇关于无限网络抓取 Twitter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆