如何用BeautifulSoup刮Instagram [英] How to scrape Instagram with BeautifulSoup

查看:69
本文介绍了如何用BeautifulSoup刮Instagram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从公共Instagram帐户中抓取图片.我对bs4非常熟悉,因此我从此开始.在Chrome上使用元素检查器时,我注意到这些图片位于无序列表中,并且li具有照片"类,因此我想知道,到底是什么-用findAll很难抓到,对吗?

I want to scrape pictures from a public Instagram account. I'm pretty familiar with bs4 so I started with that. Using the element inspector on Chrome, I noted the pictures are in an unordered list and li has class 'photo', so I figure, what the hell -- can't be that hard to scrape with findAll, right?

错误:它不返回任何内容(下面的代码),我很快注意到元素检查器中显示的代码与我从请求中提取的代码不相同,也就是其中没有无序列表我从请求中提取的代码.

Wrong: it doesn't return anything (code below) and I soon notice that the code shown in element inspector and the code that I drew from requests were not the same AKA no unordered list in the code I pulled from requests.

您知道如何获取显示在元素检查器中的代码吗?

Any idea how I can get the code that shows up in element inspector?

仅作记录,这是我要启动的代码,由于无序列表不存在而无法运行:

Just for the record, this was my code to start, which didn't work because the unordered list was not there:

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('http://instagram.com/umnpics/')
soup = BeautifulSoup(r.text)
for x in soup.findAll('li', {'class':'photo'}):
    print x

谢谢您的帮助.

推荐答案

如果查看该页面的源代码,则会看到一些javascript生成了该网页.您在元素浏览器中看到的是运行脚本后的网页,beautifulsoup仅获取html文件.为了解析渲染的网页,您需要使用 Selenium 之类的东西来为您渲染网页

If you look at the source code for the page, you'll see that some javascript generates the webpage. What you see in the element browser is the webpage after the script has been run, and beautifulsoup just gets the html file. In order to parse the rendered webpage you'll need to use something like Selenium to render the webpage for you.

例如,这就是硒的外观:

So, for example, this is how it would look with Selenium:

from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = 'http://instagram.com/umnpics/'
driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source)

for x in soup.findAll('li', {'class':'photo'}):
    print x

现在汤应该是您所期望的.

Now the soup should be what you are expecting.

这篇关于如何用BeautifulSoup刮Instagram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆