从Kickstarter项目中刮取文字不会返回任何内容 [英] Scraping text from Kickstarter projects return nothing
问题描述
我正在尝试从Kickstarter项目网页上抓取项目的主要文本.我有以下代码适用于第一个URL,但不适用于第二个和第三个URL. 我想知道是否可以在无需使用其他软件包的情况下轻松修复我的代码?
I am trying to scrape the main text of a project from the Kickstarter project webpage. I have the following code which works for the first URL but does not work for the second and third URL. I was wondering if there is an easy fix to my code without the need to use other packages?
url = "https://www.kickstarter.com/projects/1365297844/kuhkubus-3d-escher-figures?ref=discovery_staff_picks_category_newest"
#url = "https://www.kickstarter.com/projects/clarissaredwine/swingby-a-voyager-gravity-puzzle?ref=discovery_staff_picks_category_newest"
#url = "https://www.kickstarter.com/projects/100389301/us-army-navy-marines-air-force-special-challenge-c?ref=category"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
body_text = soup.find(class_='rte__content')
all_text = body_text.find_all('p')
for i in all_text:
print(i.get_text())
推荐答案
有一个 GraphQL API 网站位于:
POST https://www.kickstarter.com/graph
我们可以使用它来获取站点数据,而不用为任何URL(任何项目)抓取html.另外,我们将提取两个字段story
和risks
.
We can use it to get the site data instead of scraping html for any URL (any project). Also, there is two fields story
and risks
that we will extract.
此Graphql API需要一个csrf令牌,该令牌嵌入在页面的meta
标记中(任何页面都可以).另外,我们需要使用请求会话存储cookie,否则调用将失败.
This Graphql API needs a csrf token that is embedded in a meta
tag in the page (any page will do). Also we need to store the cookies using request session otherwise the call will fail.
这里是使用python :
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://www.kickstarter.com")
soup = BeautifulSoup(r.text, 'html.parser')
xcsrf = soup.find("meta", {"name": "csrf-token"})["content"]
query = """
query GetEndedToLive($slug: String!) {
project(slug: $slug) {
id
deadlineAt
showCtaToLiveProjects
state
description
url
__typename
}
}"""
r = s.post("https://www.kickstarter.com/graph",
headers= {
"x-csrf-token": xcsrf
},
json = {
"query": query,
"variables": {
"slug":"kuhkubus-3d-escher-figures"
}
})
print(r.json())
在第二个链接中,它显示了查询中有趣的字段.完整的查询如下:
From your second link it shows interesting fields in the query. The complete query is the following :
query Campaign($slug: String!) {
project(slug: $slug) {
id
isSharingProjectBudget
risks
story(assetWidth: 680)
currency
spreadsheet {
displayMode
public
url
data {
name
value
phase
rowNum
__typename
}
dataLastUpdatedAt
__typename
}
environmentalCommitments {
id
commitmentCategory
description
__typename
}
__typename
}
}
我们只对story
和risks
感兴趣,所以我们将:
We are only interested in the story
and risks
so we will have :
query Campaign($slug: String!) {
project(slug: $slug) {
risks
story(assetWidth: 680)
}
}
请注意,我们需要项目slug,它是url的一部分,例如clarissaredwine/swingby-a-voyager-gravity-puzzle
是第二个URL的slug.
Note that we need the project slug which is a part of the url, for instance clarissaredwine/swingby-a-voyager-gravity-puzzle
is the slug for you 2nd url.
这里是一个示例实现,该实现提取一个段,循环遍历这些段,并为每个段调用GraphQL端点,它会打印每个段的故事和风险:
Here is a sample implementation that extract the slugs, loop through the slugs and call the GraphQL endpoint for each slug, it prints the story and the risks for each of them :
import requests
from bs4 import BeautifulSoup
import re
urls = [
"https://www.kickstarter.com/projects/1365297844/kuhkubus-3d-escher-figures?ref=discovery_staff_picks_category_newest",
"https://www.kickstarter.com/projects/clarissaredwine/swingby-a-voyager-gravity-puzzle?ref=discovery_staff_picks_category_newest",
"https://www.kickstarter.com/projects/100389301/us-army-navy-marines-air-force-special-challenge-c?ref=category"
]
slugs = []
#extract slugs from url
for url in urls:
slugs.append(re.search('/projects/(.*)\?', url).group(1))
s = requests.Session()
r = s.get("https://www.kickstarter.com")
soup = BeautifulSoup(r.text, 'html.parser')
xcsrf = soup.find("meta", {"name": "csrf-token"})["content"]
query = """
query Campaign($slug: String!) {
project(slug: $slug) {
risks
story(assetWidth: 680)
}
}"""
for slug in slugs:
print(f"--------{slug}------")
r = s.post("https://www.kickstarter.com/graph",
headers= {
"x-csrf-token": xcsrf
},
json = {
"operationName":"Campaign",
"variables":{
"slug": slug
},
"query": query
})
result = r.json()
print("-------STORY--------")
story_html = result["data"]["project"]["story"]
soup = BeautifulSoup(story_html, 'html.parser')
for i in soup.find_all('p'):
print(i.get_text())
print("-------RISKS--------")
print(result["data"]["project"]["risks"])
如果您要在此站点上抓取其他内容,我想您可以将graphQL端点用于许多其他事情.但是,请注意,此API已禁用自省,因此您只能查找现有架构网站上的使用情况(您无法获得整个架构)
I guess you can use the graphQL endpoint for many other things if you are scraping other content on this site. However, note that the introspection has been disabled on this API, so you can only look for existing schema usage on the site (you can't get the whole schema)
这篇关于从Kickstarter项目中刮取文字不会返回任何内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!