通过 POST 抓取 Bandcamp 粉丝收藏 [英] Scraping Bandcamp fan collections via POST

查看:21
本文介绍了通过 POST 抓取 Bandcamp 粉丝收藏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试抓取 Bandcamp 的粉丝页面以获取他们购买的专辑列表,但我无法有效地执行此操作.我用 Selenium 写了一些东西,但它有点慢,所以我想学习一个解决方案,它可能会向站点发送 POST 请求并从那里解析 JSON.

这是一个示例集合页面:https://bandcamp.com/nhoward

这是硒代码:

def scrapeFanCollection(url):浏览器 = getBrowser()setattr(线程本地,'浏览器',浏览器)#转到网址browser.get(url)尝试:#点击显示更多按钮browser.find_element_by_class_name('show-more').click()#等两秒时间.sleep(2)#滚动到底部加载完整集合滚动(浏览器,2)除了例外:经过#返回完整专辑收藏汤_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))#空数组网址 = []# 循环遍历页面源中的所有 a 元素对于汤_a.find_all('a', {class": item-link"}) 中的项目:url = item.get('href')如果(网址!= 无):urls.append(url)返回网址

解决方案

API 访问方式如下:

$ curl -X POST -H "Content-Type: Application/JSON";-d \'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \https://bandcamp.com/api/fancollection/1/collection_items

我没有遇到 older_than_token" 过时的情况,因此问题归结为获取给定 URL 的 fan_id".

此信息位于 id="pagedata" 元素中的 blob 中.

<预><代码>>>>导入json>>>进口请求>>>从 bs4 导入 BeautifulSoup>>>res = requests.get("https://www.bandcamp.com/ggorlen")>>>汤 = BeautifulSoup(res.text, lxml")>>>用户 = json.loads(soup.find(id=pagedata")[data-blob"])>>>用户[fan_data"][fan_id"]82985

综合起来(基于这个答案):

导入json进口请求从 bs4 导入 BeautifulSoupfan_page_url = "https://www.bandcamp.com/ggorlen";collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items";res = requests.get(fan_page_url)汤 = BeautifulSoup(res.text, lxml")用户 = json.loads(soup.find(id=pagedata")[data-blob"])数据 = {fan_id":用户[fan_data"][fan_id"],older_than_token":用户[wishlist_data"][last_token"],计数":10000,}res = requests.post(collection_items_url, json=data)集合 = res.json()对于集合中的项目[项目"][:10]:打印(项目[专辑名称"],项目[项目网址"])

我正在使用 user[wishlist_data"][last_token"],它与 older_than_token" 的格式相同,以防万一.

I've been trying to scrape Bandcamp fan pages to get a list of the albums they have purchased and I'm having trouble efficiently doing it. I wrote something with Selenium but it's mildly slow so I'd like to learn a solution that'd maybe send a POST request to the site and parse the JSON from there.

Here's a sample collection page: https://bandcamp.com/nhoward

Here's the Selenium code:

def scrapeFanCollection(url):
    browser = getBrowser()
    setattr(threadLocal, 'browser', browser)
    #Go to url
    browser.get(url)
    
    try:
        #Click show more button
        browser.find_element_by_class_name('show-more').click()
        
        #Wait two seconds
        time.sleep(2)
        #Scroll to the bottom loading full collection
        scroll(browser, 2)
    except Exception:
        pass
    
    #Return full album collection
    soup_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))
        
    #Empty array
    urls = []
    
    # Looping through all the a elements in the page source
    for item in soup_a.find_all('a', {"class": "item-link"}):
        url = item.get('href')
        if(url != None):
            urls.append(url)
    
    return urls

解决方案

The API can be accessed as follows:

$ curl -X POST -H "Content-Type: Application/JSON" -d \
'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \
https://bandcamp.com/api/fancollection/1/collection_items

I didn't encounter a scenario where a "older_than_token" was stale, so the problem boils down to getting the "fan_id" given a URL.

This information is located in a blob in the id="pagedata" element.

>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get("https://www.bandcamp.com/ggorlen")
>>> soup = BeautifulSoup(res.text, "lxml")
>>> user = json.loads(soup.find(id="pagedata")["data-blob"])
>>> user["fan_data"]["fan_id"]
82985

Putting it all together (building upon this answer):

import json
import requests
from bs4 import BeautifulSoup

fan_page_url = "https://www.bandcamp.com/ggorlen"
collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items"
res = requests.get(fan_page_url)
soup = BeautifulSoup(res.text, "lxml")
user = json.loads(soup.find(id="pagedata")["data-blob"])

data = {
    "fan_id": user["fan_data"]["fan_id"],
    "older_than_token": user["wishlist_data"]["last_token"],
    "count": 10000,
}
res = requests.post(collection_items_url, json=data)
collection = res.json()

for item in collection["items"][:10]:
    print(item["album_title"], item["item_url"])

I'm using user["wishlist_data"]["last_token"] which has the same format as the "older_than_token" just in case this matters.

这篇关于通过 POST 抓取 Bandcamp 粉丝收藏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆