如何在Python中抓取完整的instagram页面? [英] How do I scrape a full instagram page in python?

查看:232
本文介绍了如何在Python中抓取完整的instagram页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

长话短说,我正在尝试创建一个Instagram python刮板,该刮板会加载整个页面并获取指向图像的所有链接.我有它的工作,唯一的问题是,它只加载Instagram显示的原始12张照片.无论如何,我可以告诉请求加载整个页面吗?

Long story short, I'm trying to create an Instagram python scraper, that loads the entire page and grabs all the links to the images. I have it working, only problem is, it only loads the original 12 photos that Instagram shows. Is there anyway I can tell requests to load the entire page?

工作代码;

import json
import requests
from bs4 import BeautifulSoup
import sys

r = requests.get('https://www.instagram.com/accountName/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    image_src = post['node']['display_url']
    print(image_src)

推荐答案

正如Scratch所述,Instagram使用无限滚动"功能,不允许您加载整个页面.但是您可以在页面顶部(在_fd86t类的第一个范围内)检查消息总数.然后,您可以检查页面是否已经包含所有消息.否则,您将必须使用GET请求来获取新的JSON响应.这样做的好处是该请求包含first字段,该字段似乎允许您修改收到的消息数量.您可以对其标准版本12进行修改,以获取所有剩余消息(希望如此).

As Scratch already mentioned, Instagram uses "infinite scrolling" which won't allow you to load the entire page. But you can check the total amount of messages at the top of the page (within the first span with the _fd86t class). Then you can check if the page already contains all of the messages. Otherwise, you'll have to use a GET request to get a new JSON response. The benefit to this is that this request contains the first field, which seems to allow you to modify how many messages you get. You can modify this from its standard 12 to get all of the remaining messages (hopefully).

必要的请求与以下内容相似(我已将实际条目匿名化,并在注释中提供了一些帮助):

The necessary request looks similar to the following (where I've anonymised the actual entries, and with some help from the comments):

https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables={"id":"XXX","first":12,"after":"XXX"}

这篇关于如何在Python中抓取完整的instagram页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆