在python中从推文中提取外部链接 [英] Extracting external links from tweets in python

查看:23
本文介绍了在python中从推文中提取外部链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了这个简单的程序来从特定用户的推文中提取链接.我能够提取推文中的链接,但似乎我得到的只是以 t.co 作为域缩短的链接.这些链接指向其他推文.

I wrote this simple program to extract links from tweets for a certain user. I was able to extract the links that are inside the tweets, but it seems like all I am getting are links that are shortened with t.co as the domain. These links are leading to other tweets.

问题是这些链接有时会导致其他推文.我如何从推文中获取链接并确保这些链接是针对外部网站的,而不是针对 Twitter 本身的.

The problem is that these links sometimes lead to other tweets. How do I get links from tweets and make sure that these links are for an external site, not twitter itself.

我希望我的问题很清楚,因为这是我能描述它的最好方式.

I hope my question is clear because this is the best way I can describe it.

谢谢

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import re

#http://www.tweepy.org/
import tweepy

#Get your Twitter API credentials and enter them here
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

#method to get a user's last  200 tweets
def get_tweets(username):

        #http://tweepy.readthedocs.org/en/v3.1.0/getting_started.html#api
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_key, access_secret)
        api = tweepy.API(auth)

        #set count to however many tweets you want; twitter only allows 200 at once
        number_of_tweets = 200

        #get tweets
        tweets = api.user_timeline(screen_name = username,count = number_of_tweets)

        for tweet in tweets:
                urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
                for url in urls:
                        print url


#if we're running this as a script
if __name__ == '__main__':

    #get tweets for username passed at command line
    if len(sys.argv) == 2:
        get_tweets(sys.argv[1])
    else:
        print "Error: enter one username"

    #alternative method: loop through multiple users
        # users = ['user1','user2']

        # for user in users:
#       get_tweets(user)

这是一个输出示例:(我无法发布它,因为它缩短了链接).编辑不允许我这样做.

Here is an output sample: (I could not post it because it has shortened links). Editor wouldn't allow me to.

推荐答案

您需要获取重定向的 URL.首先,添加 import urllib2 然后尝试以下代码:

You need to get the redirected URL. First, add import urllib2 then try the following code:

for tweet in tweets:
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
    for url in urls:
        try:
            res = urllib2.urlopen(url)
            actual_url = res.geturl()
            print actual_url
        except:
            print url

我有 try..except 块,因为我测试的一些推文正在提取无效的 URL.

I have the try..except block because some of the tweets I tested were extracting invalid URLs.

这篇关于在python中从推文中提取外部链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆