从Kickstarter项目中刮取文字不会返回任何内容 [英] Scraping text from Kickstarter projects return nothing

查看:89
本文介绍了从Kickstarter项目中刮取文字不会返回任何内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Kickstarter项目网页上抓取项目的主要文本.我有以下代码适用于第一个URL,但不适用于第二个和第三个URL. 我想知道是否可以在无需使用其他软件包的情况下轻松修复我的代码?

I am trying to scrape the main text of a project from the Kickstarter project webpage. I have the following code which works for the first URL but does not work for the second and third URL. I was wondering if there is an easy fix to my code without the need to use other packages?

url = "https://www.kickstarter.com/projects/1365297844/kuhkubus-3d-escher-figures?ref=discovery_staff_picks_category_newest"
#url = "https://www.kickstarter.com/projects/clarissaredwine/swingby-a-voyager-gravity-puzzle?ref=discovery_staff_picks_category_newest"
#url = "https://www.kickstarter.com/projects/100389301/us-army-navy-marines-air-force-special-challenge-c?ref=category"

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
body_text = soup.find(class_='rte__content')
all_text = body_text.find_all('p')
for i in all_text:
    print(i.get_text())

推荐答案

有一个 GraphQL API 网站位于:

POST https://www.kickstarter.com/graph

我们可以使用它来获取站点数据,而不用为任何URL(任何项目)抓取html.另外,我们将提取两个字段storyrisks.

We can use it to get the site data instead of scraping html for any URL (any project). Also, there is two fields story and risks that we will extract.

此Graphql API需要一个csrf令牌,该令牌嵌入在页面的meta标记中(任何页面都可以).另外,我们需要使用请求会话存储cookie,否则调用将失败.

This Graphql API needs a csrf token that is embedded in a meta tag in the page (any page will do). Also we need to store the cookies using request session otherwise the call will fail.

这里是使用:

import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get("https://www.kickstarter.com")
soup = BeautifulSoup(r.text, 'html.parser')
xcsrf = soup.find("meta", {"name": "csrf-token"})["content"]

query = """
query GetEndedToLive($slug: String!) {
  project(slug: $slug) {
      id
      deadlineAt
      showCtaToLiveProjects
      state
      description
      url
      __typename
  }
}"""

r = s.post("https://www.kickstarter.com/graph",
    headers= {
        "x-csrf-token": xcsrf
    },
    json = {
        "query": query,
        "variables": {
            "slug":"kuhkubus-3d-escher-figures"
        }
    })

print(r.json())

在第二个链接中,它显示了查询中有趣的字段.完整的查询如下:

From your second link it shows interesting fields in the query. The complete query is the following :

query Campaign($slug: String!) {
  project(slug: $slug) {
    id
    isSharingProjectBudget
    risks
    story(assetWidth: 680)
    currency
    spreadsheet {
      displayMode
      public
      url
      data {
        name
        value
        phase
        rowNum
        __typename
      }
      dataLastUpdatedAt
      __typename
    }
    environmentalCommitments {
      id
      commitmentCategory
      description
      __typename
    }
    __typename
  }
}

我们只对storyrisks感兴趣,所以我们将:

We are only interested in the story and risks so we will have :

query Campaign($slug: String!) {
  project(slug: $slug) {
    risks
    story(assetWidth: 680)
  }
}

请注意,我们需要项目slug,它是url的一部分,例如clarissaredwine/swingby-a-voyager-gravity-puzzle是第二个URL的slug.

Note that we need the project slug which is a part of the url, for instance clarissaredwine/swingby-a-voyager-gravity-puzzle is the slug for you 2nd url.

这里是一个示例实现,该实现提取一个段,循环遍历这些段,并为每个段调用GraphQL端点,它会打印每个段的故事和风险:

Here is a sample implementation that extract the slugs, loop through the slugs and call the GraphQL endpoint for each slug, it prints the story and the risks for each of them :

import requests
from bs4 import BeautifulSoup
import re

urls = [ 
    "https://www.kickstarter.com/projects/1365297844/kuhkubus-3d-escher-figures?ref=discovery_staff_picks_category_newest",
    "https://www.kickstarter.com/projects/clarissaredwine/swingby-a-voyager-gravity-puzzle?ref=discovery_staff_picks_category_newest",
    "https://www.kickstarter.com/projects/100389301/us-army-navy-marines-air-force-special-challenge-c?ref=category"
]
slugs = []

#extract slugs from url
for url in urls:
    slugs.append(re.search('/projects/(.*)\?', url).group(1))

s = requests.Session()
r = s.get("https://www.kickstarter.com")
soup = BeautifulSoup(r.text, 'html.parser')
xcsrf = soup.find("meta", {"name": "csrf-token"})["content"]

query = """
query Campaign($slug: String!) {
  project(slug: $slug) {
    risks
    story(assetWidth: 680)
  }
}"""

for slug in slugs:
    print(f"--------{slug}------")
    r = s.post("https://www.kickstarter.com/graph",
        headers= {
            "x-csrf-token": xcsrf
        },
        json = {
            "operationName":"Campaign",
            "variables":{
                "slug": slug
            },
            "query": query
        })

    result = r.json()

    print("-------STORY--------")
    story_html = result["data"]["project"]["story"]
    soup = BeautifulSoup(story_html, 'html.parser')
    for i in soup.find_all('p'):
        print(i.get_text())

    print("-------RISKS--------")
    print(result["data"]["project"]["risks"])

如果您要在此站点上抓取其他内容,我想您可以将graphQL端点用于许多其他事情.但是,请注意,此API已禁用自省,因此您只能查找现有架构网站上的使用情况(您无法获得整个架构)

I guess you can use the graphQL endpoint for many other things if you are scraping other content on this site. However, note that the introspection has been disabled on this API, so you can only look for existing schema usage on the site (you can't get the whole schema)

这篇关于从Kickstarter项目中刮取文字不会返回任何内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆