如何在 python 中抓取完整的 Instagram 页面? [英] How do I scrape a full instagram page in python?

查看:35
本文介绍了如何在 python 中抓取完整的 Instagram 页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

长话短说,我正在尝试创建一个 Instagram python 抓取工具,它加载整个页面并抓取所有指向图像的链接.我让它工作了,唯一的问题是,它只加载了 Instagram 显示的原始 12 张照片.无论如何我可以告诉请求加载整个页面吗?

Long story short, I'm trying to create an Instagram python scraper, that loads the entire page and grabs all the links to the images. I have it working, only problem is, it only loads the original 12 photos that Instagram shows. Is there anyway I can tell requests to load the entire page?

工作代码;

import json
import requests
from bs4 import BeautifulSoup
import sys

r = requests.get('https://www.instagram.com/accountName/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    image_src = post['node']['display_url']
    print(image_src)

推荐答案

正如 Scratch 已经提到的,Instagram 使用无限滚动",它不允许你加载整个页面.但是您可以检查页面顶部的消息总量(在带有 _fd86t 类的第一个范围内).然后您可以检查该页面是否已包含所有消息.否则,您将不得不使用 GET 请求来获取新的 JSON 响应.这样做的好处是该请求包含 first 字段,它似乎允许您修改获得的消息数量.您可以修改它的标准 12 以获取所有剩余的消息(希望如此).

As Scratch already mentioned, Instagram uses "infinite scrolling" which won't allow you to load the entire page. But you can check the total amount of messages at the top of the page (within the first span with the _fd86t class). Then you can check if the page already contains all of the messages. Otherwise, you'll have to use a GET request to get a new JSON response. The benefit to this is that this request contains the first field, which seems to allow you to modify how many messages you get. You can modify this from its standard 12 to get all of the remaining messages (hopefully).

必要的请求类似于以下内容(我对实际条目进行了匿名处理,并在评论中提供了一些帮助):

The necessary request looks similar to the following (where I've anonymised the actual entries, and with some help from the comments):

https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables={"id":"XXX","first":12,"after":"XXX"}

这篇关于如何在 python 中抓取完整的 Instagram 页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆