用BeautifulSoup刮Instagram [英] Scraping Instagram with BeautifulSoup

查看:91
本文介绍了用BeautifulSoup刮Instagram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Instagram的按标签搜索"中获取特定的字符串. 我想从这里获取url img:

I'm trying to get a particular string from the "search by tag" in Instagram. I'd like to get the url img from here:

<img alt="#yeşil  #manzara #doğa  
#yayla #nature #naturelovers #adventuretime #adventures #mountainstaries 
#picture #şehirdenuzak  #tatil #holiday #cow  #potography #view #kütükev 
#naturelife #animal #amazing  #kar #winter #winteriscomming #mapavr1 #artvin 
#tulumile #insaatr #tulumci #rize 
class="_2di5p" sizes="171px" srcset="https://scontent-mxp11.cdninstagram.com/vp/c883e0c4267c003843fafeda255f1329/5A9D3C97/t51.2885-15/s150x150/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 150w,
https://scontent-mxp1-1.cdninstagram.com/vp/6a3480f8658b50c691bcc100a96cc6f0/5A9CC9DC/t51.2885-15/s240x240/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 240w,
https://scontent-mxp1-1.cdninstagram.com/vp/461c138e15f52420c3fbc075fab027eb/5A9DD808/t51.2885-15/s320x320/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 320w,
https://scontent-mxp1-1.cdninstagram.com/vp/ad5d67f1c9ea77d78d145501e73c2ea0/5A9CAF9D/t51.2885-15/s480x480/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 480w,
https://scontent-mxp1-1.cdninstagram.com/vp/e0636f79adc1ae53f7321d10fe60f275/5A9CD134/t51.2885-15/s640x640/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg 640w" 
src="https://scontent-mxp1-1.cdninstagram.com/vp/e0636f79adc1ae53f7321d10fe60f275/5A9CD134/t51.2885-15/s640x640/e15/c0.90.720.720/28154674_2016914221854461_991623208941649920_n.jpg" style="">

所以基本上我很想得到这个字符串(最后是240w的字符串):

so basically I wolud like to get this string (That is the one with 240w at the end):

https://scontent-mxp1-1.cdninstagram.com/vp/6a3480f8658b50c691bcc100a96cc6f0/../n.jpg

我尝试用Python编写此代码,但是它不起作用

and I tried writing this code with Python but it doesn't work

import requests
from bs4 import BeautifulSoup

request = requests.get("https://www.instagram.com/explore/tags/nature/")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("srcset")
print(element.text.strip())

也许真正的问题是页面中有21个像这样的元素 但首先,我想了解如何获取该字符串.

maybe the real problem is that there are 21 elements like this one in the page but to start I'd like to understand how to get that string.

(而且,如果您当中有人知道bs4的好教程或书,您能告诉我吗?)

(And, if any of you know a good tutorial or book for bs4 can you tell me?)

推荐答案

看不到任何输出的原因是使用JavaScript将图像动态添加到页面源中.因此,您提供的HTML在页面源中不可用.解决此问题的最简单方法是使用.

The reason you can't see any output is that the images are added dynamically to the page source using JavaScript. So, the HTML that you've provided isn't available in the page source. Easiest way to overcome this is to use Selenium.

但是,还有另一种方法可以消除这种情况.查看页面源,可以在JSON格式的<script>标记中找到所需的数据.相关数据的格式为:

But, there's one more way to scrape that. Looking at the page source, the data you're after, is available in a <script> tag in the form of JSON. The relevant data is in the form of:

"thumbnail_resources": [
    {
        "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/a3ed0ee1af581f1c1fe6170b8c080e7c/5B2CA660/t51.2885-15/s150x150/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 150,
         "config_height": 150
     },
     {
         "src": "https://instagram.fpnq3-1.fna.fbcdn.net/vp/7a0bb4fb1b5d5e3b179c58a2b9472b9f/5B2C535F/t51.2885-15/s240x240/e35/28433503_571483933190064_5347634166450094080_n.jpg",
         "config_width": 240,
         "config_height": 240
     },

要获取JSON,您可以使用以下代码(取自此答案的代码):

To get the JSON, you can use this (code taken from this answer):

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

获取所有图像的图像链接的代码:

Code to get image link for all the images:

import json
import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.instagram.com/explore/tags/nature/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)

for post in data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
    image_src = post['node']['thumbnail_resources'][1]['src']
    print(image_src)

部分输出:

https://instagram.fpnq3-1.fna.fbcdn.net/vp/e8a78407fb61de834cad7f10eca830fc/5A9DC375/t51.2885-15/s240x240/e15/c0.80.640.640/28766397_174603559842180_1092148752455565312_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/3a20f36647c86c2196f259b5d14ebf82/5A9D5BC9/t51.2885-15/s240x240/e15/28433802_283862648812409_3322859933120069632_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/82216be4596dd9da862ba267cdeab517/5B144226/t51.2885-15/s240x240/e35/c0.135.1080.1080/28157436_941679549319762_5605299824451649536_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/e50eab90b2e0951d67922e49b495e1fc/5B3EC9B8/t51.2885-15/s240x240/e35/c135.0.810.810/28754107_179533402825352_1137703808411893760_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/d3a13e7b81a65421b4318b57fb8ee24e/5B4D9EFF/t51.2885-15/s240x240/e35/28433583_375555202918683_1951892035636035584_n.jpg
https://instagram.fpnq3-1.fna.fbcdn.net/vp/1b0aeea1b9be983498192d350e039aa0/5B43C583/t51.2885-15/s240x240/e35/28156427_154249191953160_9219472301039288320_n.jpg
...

注意:image_src = post['node']['thumbnail_resources'][1]['src']行中的[1]适用于240w.您可以分别为150w,240w,320w,480w或640w使用0、1、2、3或4.另外,如果您需要有关任何图像的其他任何数据,例如,喜欢的次数,评论,标题等; (data变量)中的所有内容均可用.

Note: The [1] in the line image_src = post['node']['thumbnail_resources'][1]['src'] is for 240w. You can use 0, 1, 2, 3 or 4 for 150w, 240w, 320w, 480w or 640w respectively. Also, if you want any other data regarding any image, like, number of likes, comments, caption, etc; everything is available in this JSON (data variable).

这篇关于用BeautifulSoup刮Instagram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆