通过 POST 抓取 Bandcamp 粉丝收藏 [英] Scraping Bandcamp fan collections via POST
问题描述
我一直在尝试抓取 Bandcamp 的粉丝页面以获取他们购买的专辑列表,但我无法有效地执行此操作.我用 Selenium 写了一些东西,但它有点慢,所以我想学习一个解决方案,它可能会向站点发送 POST 请求并从那里解析 JSON.
这是一个示例集合页面:https://bandcamp.com/nhoward
这是硒代码:
def scrapeFanCollection(url):浏览器 = getBrowser()setattr(线程本地,'浏览器',浏览器)#转到网址browser.get(url)尝试:#点击显示更多按钮browser.find_element_by_class_name('show-more').click()#等两秒时间.sleep(2)#滚动到底部加载完整集合滚动(浏览器,2)除了例外:经过#返回完整专辑收藏汤_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))#空数组网址 = []# 循环遍历页面源中的所有 a 元素对于汤_a.find_all('a', {class": item-link"}) 中的项目:url = item.get('href')如果(网址!= 无):urls.append(url)返回网址
API 访问方式如下:
$ curl -X POST -H "Content-Type: Application/JSON";-d \'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \https://bandcamp.com/api/fancollection/1/collection_items
我没有遇到 older_than_token"
过时的情况,因此问题归结为获取给定 URL 的 fan_id"
.
此信息位于 id="pagedata"
元素中的 blob 中.
综合起来(基于这个答案):
导入json进口请求从 bs4 导入 BeautifulSoupfan_page_url = "https://www.bandcamp.com/ggorlen";collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items";res = requests.get(fan_page_url)汤 = BeautifulSoup(res.text, lxml")用户 = json.loads(soup.find(id=pagedata")[data-blob"])数据 = {fan_id":用户[fan_data"][fan_id"],older_than_token":用户[wishlist_data"][last_token"],计数":10000,}res = requests.post(collection_items_url, json=data)集合 = res.json()对于集合中的项目[项目"][:10]:打印(项目[专辑名称"],项目[项目网址"])
我正在使用 user[wishlist_data"][last_token"]
,它与 older_than_token"
的格式相同,以防万一.
I've been trying to scrape Bandcamp fan pages to get a list of the albums they have purchased and I'm having trouble efficiently doing it. I wrote something with Selenium but it's mildly slow so I'd like to learn a solution that'd maybe send a POST request to the site and parse the JSON from there.
Here's a sample collection page: https://bandcamp.com/nhoward
Here's the Selenium code:
def scrapeFanCollection(url):
browser = getBrowser()
setattr(threadLocal, 'browser', browser)
#Go to url
browser.get(url)
try:
#Click show more button
browser.find_element_by_class_name('show-more').click()
#Wait two seconds
time.sleep(2)
#Scroll to the bottom loading full collection
scroll(browser, 2)
except Exception:
pass
#Return full album collection
soup_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))
#Empty array
urls = []
# Looping through all the a elements in the page source
for item in soup_a.find_all('a', {"class": "item-link"}):
url = item.get('href')
if(url != None):
urls.append(url)
return urls
The API can be accessed as follows:
$ curl -X POST -H "Content-Type: Application/JSON" -d \
'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \
https://bandcamp.com/api/fancollection/1/collection_items
I didn't encounter a scenario where a "older_than_token"
was stale, so the problem boils down to getting the "fan_id"
given a URL.
This information is located in a blob in the id="pagedata"
element.
>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get("https://www.bandcamp.com/ggorlen")
>>> soup = BeautifulSoup(res.text, "lxml")
>>> user = json.loads(soup.find(id="pagedata")["data-blob"])
>>> user["fan_data"]["fan_id"]
82985
Putting it all together (building upon this answer):
import json
import requests
from bs4 import BeautifulSoup
fan_page_url = "https://www.bandcamp.com/ggorlen"
collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items"
res = requests.get(fan_page_url)
soup = BeautifulSoup(res.text, "lxml")
user = json.loads(soup.find(id="pagedata")["data-blob"])
data = {
"fan_id": user["fan_data"]["fan_id"],
"older_than_token": user["wishlist_data"]["last_token"],
"count": 10000,
}
res = requests.post(collection_items_url, json=data)
collection = res.json()
for item in collection["items"][:10]:
print(item["album_title"], item["item_url"])
I'm using user["wishlist_data"]["last_token"]
which has the same format as the "older_than_token"
just in case this matters.
这篇关于通过 POST 抓取 Bandcamp 粉丝收藏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!