如何通过Instagram网络浏览器从网上抓取追随者? [英] How to web scrape followers from Instagram web browser?

查看:98
本文介绍了如何通过Instagram网络浏览器从网上抓取追随者?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能告诉我如何访问底层URL来查看给定用户的Instagram关注者?我可以使用Instagram API来执行此操作,但是鉴于审批流程有待更改,我决定切换到抓取.

Can anyone tell me how to access the underlying URL to view a given user's Instagram followers? I am able to do this with Instagram API, but given the pending changes to the approval process, I have decided to switch to scraping.

Instagram网络浏览器允许您查看任何给定公共用户的关注者列表-例如,要查看Instagram的关注者,请访问"

The Instagram web browser allows you to view the follower list for any given public user - for example, to view Instagram's followers, visit "https://www.instagram.com/instagram", and then click on the followers URL to open a window that paginates through viewers (note: you must be logged in to your account to view this).

我注意到URL更改为" https://www.instagram.com/instagram/followers "窗口弹出时,但我似乎无法查看此URL的基础页面源.

I note that the URL changes to "https://www.instagram.com/instagram/followers" when this window pops up, but I can't seem to view the underlying page source for this URL.

由于它出现在我的浏览器窗口中,因此我认为我将能够进行抓取.但是我是否必须使用Selenium之类的软件包?有人知道底层URL是什么,因此我不必使用Selenium?

Since it appears on my browser window, I assume that I will be able to scrape. But do I have to use a package like Selenium? Does anyone know what the underlying URL is, so I don't have to use Selenium?

举例来说,我可以通过访问"instagram.com/instagram/media/"直接访问基础提要数据,从中我可以在所有迭代中进行抓取和分页.我想对关注者列表进行类似操作,并直接访问此数据(而不是使用Selenium).

As an example, I am able to directly access the underlying feed data by visiting "instagram.com/instagram/media/", from which I can scrape and paginate through all iterations. I would like to do something similar with the list of followers, and access this data directly (rather than using Selenium).

推荐答案

编辑:2018年12月更新:

EDIT: Dec 2018 Update:

自此发布以来,Insta土地的情况发生了变化.这是一个更新的脚本,它具有更多的Python风格,并且可以更好地利用XPATH/CSS路径.

Things have changed in Insta land since this was posted. Here is an updated script that is a bit more pythonic and better utilizes XPATH/CSS paths.

请注意,要使用此更新的脚本,必须安装explicit软件包(pip install explicit),或将带有waiter的每一行转换为纯硒显式等待.

Note that to use this updated script, you must install the explicit package (pip install explicit), or convert each line with waiter to a pure selenium explicit wait.

import itertools

from explicit import waiter, XPATH
from selenium import webdriver


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    waiter.find_write(driver, "//div/input[@name='username']", username, by=XPATH)
    waiter.find_write(driver, "//div/input[@name='password']", password, by=XPATH)
    waiter.find_element(driver, "//div/button[@type='submit']", by=XPATH).click()

    # Wait for the user dashboard page to load
    waiter.find_element(driver, "//a/span[@aria-label='Find People']", by=XPATH)


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    # driver.find_element_by_partial_link_text("follower").click()
    waiter.find_element(driver, "//a[@href='/instagram/followers/']", by=XPATH).click()

    # Wait for the followers modal to load
    waiter.find_element(driver, "//div[@role='dialog']", by=XPATH)

    # At this point a Followers modal pops open. If you immediately scroll to the bottom,
    # you hit a stopping point and a "See All Suggestions" link. If you fiddle with the
    # model by scrolling up and down, you can force it to load additional followers for
    # that person.

    # Now the modal will begin loading followers every time you scroll to the bottom.
    # Keep scrolling in a loop until you've hit the desired number of followers.
    # In this instance, I'm using a generator to return followers one-by-one
    follower_css = "ul div li:nth-child({}) a.notranslate"  # Taking advange of CSS's nth-child functionality
    for group in itertools.count(start=1, step=12):
        for follower_index in range(group, group + 12):
            yield waiter.find_element(driver, follower_css.format(follower_index)).text

        # Instagram loads followers 12 at a time. Find the last follower element
        # and scroll it into view, forcing instagram to load another 12
        # Even though we just found this elem in the previous for loop, there can
        # potentially be large amount of time between that call and this one,
        # and the element might have gone stale. Lets just re-acquire it to avoid
        # that
        last_follower = waiter.find_element(driver, follower_css.format(follower_index))
        driver.execute_script("arguments[0].scrollIntoView();", last_follower)


if __name__ == "__main__":
    account = 'instagram'
    driver = webdriver.Chrome()
    try:
        login(driver)
        # Print the first 75 followers for the "instagram" account
        print('Followers of the "{}" account'.format(account))
        for count, follower in enumerate(scrape_followers(driver, account=account), 1):
            print("\t{:>3}: {}".format(count, follower))
            if count >= 75:
                break
    finally:
        driver.quit()

我做了一个快速基准测试,以显示您尝试以这种方式抓取的追随者越多,性能如何呈指数下降:

I did a quick benchmark to show how performance decreases exponentially the more followers you attempt to scrape this way:

$ python example.py
Followers of the "instagram" account
Found    100 followers in 11 seconds
Found    200 followers in 19 seconds
Found    300 followers in 29 seconds
Found    400 followers in 47 seconds
Found    500 followers in 71 seconds
Found    600 followers in 106 seconds
Found    700 followers in 157 seconds
Found    800 followers in 213 seconds
Found    900 followers in 284 seconds
Found   1000 followers in 375 seconds

原始帖子: 您的问题有点令人困惑.例如,我不太确定在所有迭代中都可以从中进行抓取和分页"的实际含义.您目前正在用什么来刮和分页?

Original post: Your question is a little confusing. For instance, I'm not really sure what "from which I can scrape and paginate through all iterations" actually means. What are you currently using to scrape and paginate?

无论如何,instagram.com/instagram/media/instagram.com/instagram/followers的端点类型不同. media端点似乎是REST API,配置为返回易于解析的JSON对象.

Regardless, instagram.com/instagram/media/ is not the same type of endpoint as instagram.com/instagram/followers. The media endpoint appears to be a REST API, configured to return an easily parseable JSON object.

据我所知,followers端点不是真正的RESTful端点.相反,单击关注者"按钮后,Instagram AJAX会将信息添加到页面源中(使用React?).我认为如果不使用Selenium之类的东西就可以获取该信息,而Selenium可以加载/呈现向用户显示关注者的JavaScript.

The followers endpoint isn't really a RESTful endpoint from what I can tell. Rather, Instagram AJAXs in the information to the page source (using React?) after you click the Followers button. I don't think you will be able to get that information without using something like Selenium, which can load/render the javascript that displays the followers to the user.

此示例代码将起作用:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def login(driver):
    username = ""  # <username here>
    password = ""  # <password here>

    # Load page
    driver.get("https://www.instagram.com/accounts/login/")

    # Login
    driver.find_element_by_xpath("//div/input[@name='username']").send_keys(username)
    driver.find_element_by_xpath("//div/input[@name='password']").send_keys(password)
    driver.find_element_by_xpath("//span/button").click()

    # Wait for the login page to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.LINK_TEXT, "See All")))


def scrape_followers(driver, account):
    # Load account page
    driver.get("https://www.instagram.com/{0}/".format(account))

    # Click the 'Follower(s)' link
    driver.find_element_by_partial_link_text("follower").click()

    # Wait for the followers modal to load
    xpath = "//div[@style='position: relative; z-index: 1;']/div/div[2]/div/div[1]"
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, xpath)))

    # You'll need to figure out some scrolling magic here. Something that can
    # scroll to the bottom of the followers modal, and know when its reached
    # the bottom. This is pretty impractical for people with a lot of followers

    # Finally, scrape the followers
    xpath = "//div[@style='position: relative; z-index: 1;']//ul/li/div/div/div/div/a"
    followers_elems = driver.find_elements_by_xpath(xpath)

    return [e.text for e in followers_elems]


if __name__ == "__main__":
    driver = webdriver.Chrome()
    try:
        login(driver)
        followers = scrape_followers(driver, "instagram")
        print(followers)
    finally:
        driver.quit()

由于许多原因,该方法都存在问题,其中最主要的原因是相对于API而言,它有多慢.

This approach is problematic for a number of reasons, chief among them being how slow it is, relative to the the API.

这篇关于如何通过Instagram网络浏览器从网上抓取追随者?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆