使用 Selenium Python 进行网页抓取 [Twitter + Instagram] [英] Web Scraping with Selenium Python [Twitter + Instagram]

查看:34
本文介绍了使用 Selenium Python 进行网页抓取 [Twitter + Instagram]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据地理位置对 Instagram 和 Twitter 进行网络抓取.我可以运行查询搜索,但在将网页重新加载到更多并将字段存储到数据框时遇到了挑战.

I am trying to web scrape both Instagram and Twitter based on geolocation. I can run a query search but I am having challenges in reloading the web page to to more and store the fields to data-frame.

我确实找到了几个无需 API 密钥的网络抓取 twitter 和 Instagram 示例.但它们与#tags 关键字有关.

I did find couple of examples for web scraping twitter and Instagram without API keys. But they are with respect to #tags keywords.

我正在尝试根据地理位置和旧日期进行抓取.到目前为止,我已经在 python 3.X 中编写代码以及在 anaconda 中编写了所有最新版本的软件包.

I am trying to scrape with respect to geo location and between old dates. so far I have come this far with writing code in python 3.X and all the latest versions of packages in anaconda.

'''
    Instagram - Components
    "id": "1478232643287060472", 
     "dimensions": {"height": 1080, "width": 1080}, 
     "owner": {"id": "351633262"}, 
     "thumbnail_src": "https://instagram.fdel1-1.fna.fbcdn.net/t51.2885-15/s640x640/sh0.08/e35/17439262_973184322815940_668652714938335232_n.jpg", 
     "is_video": false, 
     "code": "BSDvMHOgw_4", 
     "date": 1490439084, 
     "taken-at=213385402"
     "display_src": "https://instagram.fdel1-1.fna.fbcdn.net/t51.2885-15/e35/17439262_973184322815940_668652714938335232_n.jpg", 
     "caption": "Hakuna jambo zuri kama kumpa Mungu shukrani kwa kila jambo.. ud83dude4fud83cudffe
Its weekend
#lifeistooshorttobeunhappy
#Godisgood 
#happysoul ud83dude00", 
     "comments": {"count": 42}, 
     "likes": {"count": 3813}}, 
'''


import selenium
from selenium import webdriver
#from selenium import selenium
from bs4 import BeautifulSoup
import pandas

#geotags = pd.read_csv("geocodes.csv")
#parmalink = 
query = geocode%3A35.68501%2C139.7514%2C30km%20since:2016-03-01%20until:2016-03-02&f=tweets

twitterURL = 'https://twitter.com/search?q=' + query
#instaURL = "https://www.instagram.com/explore/locations/213385402/"


browser = webdriver.Firefox()
browser.get(twitterURL)
content = browser.page_source

soup = BeautifulSoup(content)
print (soup)

对于 Twitter 搜索查询,我收到语法错误

For Twitter Search Query I am getting syntax error

对于 Instagram,我没有收到任何错误,但我无法重新加载更多帖子并写回 csv 数据帧.

For Instagram I am not getting any error but I am not able to reload for more posts and write back to csv dataframe.

我还尝试在 Twitter 和 Instagram 中使用经纬度搜索进行搜索.

I am also trying to search with latitude and longitude search in both Twitter and Instagram.

我在 csv 中有一个地理坐标列表,我可以使用该输入或编写搜索查询.

I have a list of geo coordinates in csv I can use that input or can write a query for search.

任何用位置完成抓取的方法将不胜感激.

Any way to complete the scraping with location will be appreciated.

感谢帮助!!

推荐答案

我设法使用 requests 使其工作.您的代码如下所示:

I managed to make it work using requests. Your code would look something like this:

from bs4 import BeautifulSoup
import requests

query = "geocode%3A35.68501%2C139.7514%2C30km%20since:2016-03-01%20until:2016-03-02&f=tweets"

twitter = 'https://twitter.com/search?q=' + query

content = requests.get(twitter)
soup = BeautifulSoup(content.text)

print(soup)

然后您可以使用soup 对象来解析您需要的内容.如果您的查询正确,同样的事情应该适用于 Instagram.

Then you can use the soup object to parse what you need. The same thing should work for Instagram, if your query is correct.

这篇关于使用 Selenium Python 进行网页抓取 [Twitter + Instagram]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆