尝试使用 requests-html (Python 3.6) 抓取 JS 网页时出现问题 [英] Problem trying to scrape a JS web with requests-html (Python 3.6)

查看:72
本文介绍了尝试使用 requests-html (Python 3.6) 抓取 JS 网页时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上周我试图从 Epic Games Store 网页(https://www.epicgames.com/store/en-US/),我第一次尝试使用 Requests 模块,但很快我意识到我需要一个支持 javascript webs 的模块.这就是我现在正在尝试的,但是有一个问题......当我使用检查元素"时在页面上,一切都很好,但是当我执行此操作时:

I've passed the last week trying to scrape information from Epic Games Store webpage (https://www.epicgames.com/store/en-US/), I first tried using the Requests module, but I soon realized I needed a module which supports javascript webs. And that's what I'm trying now, but there is a problem... When I use "inspect element" on the page, everything's fine, but when I execute this:

from requests_html import HTMLSession

session = HTMLSession()
r = session.get("https://www.epicgames.com/store/en-US/")
r.html.render()

print(r.html.html)

结果是一个不可读的 html 文件,没有加载大部分元素.结果:https://pastebin.com/zQ9m1gr2

The result is an unreadable html file without most of the elements loaded. Result: https://pastebin.com/zQ9m1gr2

你可以测试一下,从网上选择一个游戏,然后 ctrl + f 它的名字在结果文件中.您会意识到没有匹配项.我能做什么?

You can test this, pick a game from the web and then ctrl + f It's name in the result file. You will realize there are no matches. What can I do?

先谢谢你!:)

当我手动从浏览器下载 HTML 时发生的情况完全相同.

It occurs exactly the same when I download the HTML from the browser manually.

推荐答案

所以主页面不包含您要查找的数据意味着,存储数据是在收到之后的.所以我们可以使用requests来模拟浏览器的行为来获取数据.

So the main page not containing the data you look for means that, store data is received after. So we can use requests to get the data by simulating what browser did.

如果您查看开发者工具中的网络选项卡,您会看到当页面加载时,它从 graphql 端点接收存储数据.这意味着如果您模拟请求,您可以获得商店数据:

If you look network tab in developer tools, you will see that when the page loads, it receives store data from graphql endpoint. That means if you simulate the request, you can get the store data:

import requests

endpoint = "https://graphql.epicgames.com/graphql"


# This query thing is what was sent to the server
# when loading the page, I couldn't figure out how
# to write it ourselves so I basically copy pasted
# the binary data in the payload.
query = b'{"query":"\\n            query storefrontDiscoverQuery(\\n              $locale:String,\\n              $country:String\u0021\\n            )  {\\n              Storefront {\\n                storefrontModules(locale: $locale) {\\n                  ... on StorefrontBreaker {\\n                    type\\n                    title\\n                    titleGroup\\n                    description\\n                    backgroundColors\\n                    layout\\n                    link {\\n                      src\\n                      linkText\\n                    }\\n                    image {\\n                      src\\n                      alt\\n                    }\\n                  }\\n                  ... on StorefrontFreeGames {\\n                    type\\n                    title\\n                  }\\n                  ... on StorefrontCardGroup {\\n                    type\\n                    title\\n                    link {\\n                        src\\n                        linkText\\n                    }\\n                    offers {\\n                      namespace\\n                      id\\n                      offer {\\n                        \\n          title\\n          id\\n          namespace\\n          description\\n          keyImages {\\n            type\\n            url\\n          }\\n          seller {\\n              id\\n              name\\n          }\\n          urlSlug\\n          items {\\n            id\\n            namespace\\n          }\\n          customAttributes {\\n            key\\n            value\\n          }\\n          categories {\\n            path\\n          }\\n          price(country: $country) {\\n            totalPrice {\\n              discountPrice\\n              originalPrice\\n              voucherDiscount\\n              discount\\n              fmtPrice(locale: $locale) {\\n                originalPrice\\n                discountPrice\\n                intermediatePrice\\n              }\\n            }\\n            lineOffers {\\n              appliedRules {\\n                id\\n                endDate\\n              }\\n            }\\n          }\\n          linkedOfferId\\n          linkedOffer {\\n            effectiveDate\\n            customAttributes {\\n              key\\n              value\\n            }\\n          }\\n        \\n                      }\\n                    }\\n                  }\\n                  ... on StorefrontFeaturedCarousel {\\n                    type\\n                    title\\n                    slides {\\n                      title\\n                      eyebrow\\n                      description\\n                      backgroundColor\\n                      image {\\n                        src\\n                        alt\\n                      }\\n                      mobileImage {\\n                        src\\n                        alt\\n                      }\\n                      link {\\n                        src\\n                        linkText\\n                      }\\n                    }\\n                  }\\n                  ... on StorefrontTiles {\\n                    type\\n                    title\\n                    tiles {\\n                      label\\n                      genre\\n                      link {\\n                        src\\n                        linkText\\n                      }\\n                    }\\n                  }\\n                }\\n              }\\n            }\\n            ","variables":{"locale":"en-US","country":"US"}}'

data = requests.post(endpoint, headers={"Content-type": "application/json;charset=UTF-8"
                                       }, data=query)
print(data.json())

它为我们提供了此数据.(小心,它相当大.)

And it gives us this data. (Be careful, its pretty big.)

您还可以使用此获取每个产品的信息:

And also you can get per product information by using this:

import requests, json

endpoint = "https://graphql.epicgames.com/graphql"

query = {
    "query": "\n        query catalogQuery(\n            $productNamespace:String!,\n            $offerId:String!,\n            $locale:String,\n            $country:String!,\n            $lineOffers: [LineOfferReq]!) {\n                Catalog {\n                    catalogOffer(namespace: $productNamespace,\n                        id: $offerId,\n                        locale: $locale) {\n                            namespace\n                            effectiveDate\n                            id\n                            customAttributes {\n                                key\n                                value\n                            }\n                            items {\n                                id\n                                status\n                                customAttributes {\n                                    key\n                                    value\n                                }\n                            }\n                    }\n                }\n                PriceEngine {\n                    price(country: $country, lineOffers: $lineOffers) {\n                        totalPrice {\n                            discountPrice\n                            originalPrice\n                            voucherDiscount\n                            discount\n                            currencyCode\n                            currencyInfo {\n                                decimals\n                            }\n                            fmtPrice(locale: $locale) {\n                                originalPrice\n                                discountPrice\n                                intermediatePrice\n                            }\n                        }\n                        lineOffers {\n                            appliedRules {\n                                endDate\n                                discountSetting {\n                                    discountType\n                                }\n                            }\n                        }\n                    }\n                }\n            }\n        ",
    "variables": {
        "productNamespace": "cosmos",
        "offerId": "1c55202badfc4212b4f82553d5d22c3e", # This is found in the first request we made,
        "locale": "en-US",                             # data.Storefront.storefrontModules[1].offers[""0""].id to be more precise.
        "country": "US",
        "lineOffers": [{
            "offerId": "1c55202badfc4212b4f82553d5d22c3e", # The same id goes here too.
            "quantity": 1
        }],
        "calculateTax": False}
    }

data = requests.post(endpoint, headers={"Content-type": "application/json;charset=UTF-8"
                                       }, data=json.dumps(query)) # We added json.dumps because it basically turns dictionary
                                                                  # into JSON string.
print(data.json())

这给了我们:

{
  "data": {
    "Catalog": {
      "catalogOffer": {
        "namespace": "cosmos",
        "effectiveDate": "2019-07-12T00:00:00.000Z",
        "id": "1c55202badfc4212b4f82553d5d22c3e",
        "customAttributes": [
          {
            "key": "com.epicgames.app.blacklist",
            "value": "KR"
          },
          {
            "key": "isPrepurchase",
            "value": "true"
          },
          {
            "key": "availableDate",
            "value": "1573570800"
          },
          {
            "key": "developerName",
            "value": "Human Head Studios, Inc."
          }
        ],
        "items": [
          {
            "id": "70c30983cf0948e4bffc23505f232b11",
            "status": "ACTIVE",
            "customAttributes": [
              {
                "key": "SupportedPlatforms",
                "value": "Windows"
              }
            ]
          },
          {
            "id": "974e25b4bce6425d9af79cd5ffd64152",
            "status": "ACTIVE",
            "customAttributes": [
              {
                "key": "SupportedPlatforms",
                "value": "Windows"
              }
            ]
          },
          {
            "id": "159d92ebec254ecf8373709a99388a62",
            "status": "ACTIVE",
            "customAttributes": [
              {
                "key": "SupportedPlatforms",
                "value": "Windows"
              }
            ]
          },
          {
            "id": "cc67628ab455419cb3d4ecc907febbb7",
            "status": "ACTIVE",
            "customAttributes": [
              {
                "key": "SupportedPlatforms",
                "value": "Windows"
              }
            ]
          },
          {
            "id": "2f742aa604a441d1a145f70411e9d8d2",
            "status": "ACTIVE",
            "customAttributes": [
              {
                "key": "SupportedPlatforms",
                "value": "Windows"
              }
            ]
          }
        ]
      }
    },
    "PriceEngine": {
      "price": {
        "totalPrice": {
          "discountPrice": 2999,
          "originalPrice": 2999,
          "voucherDiscount": 0,
          "discount": 0,
          "currencyCode": "USD",
          "currencyInfo": {
            "decimals": 2
          },
          "fmtPrice": {
            "originalPrice": "$29.99",
            "discountPrice": "$29.99",
            "intermediatePrice": "$29.99"
          }
        },
        "lineOffers": [
          {
            "appliedRules": []
          }
        ]
      }
    }
  },
  "extensions": {
    "cacheControl": {
      "version": 1,
      "hints": [
        {
          "path": [
            "Catalog"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine",
            "price"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "customAttributes"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "items"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "items",
            0,
            "customAttributes"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "items",
            1,
            "customAttributes"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "items",
            2,
            "customAttributes"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "items",
            3,
            "customAttributes"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "Catalog",
            "catalogOffer",
            "items",
            4,
            "customAttributes"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine",
            "price",
            "totalPrice"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine",
            "price",
            "totalPrice",
            "currencyInfo"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine",
            "price",
            "totalPrice",
            "fmtPrice"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine",
            "price",
            "lineOffers"
          ],
          "maxAge": 0
        },
        {
          "path": [
            "PriceEngine",
            "price",
            "lineOffers",
            0,
            "appliedRules"
          ],
          "maxAge": 0
        }
      ]
    }
  }
}

显然,您可以从 这个网址.然后就可以用这个id查询得到游戏列表:

Apparently, you can get free-games-collection ids from this url. Then you can query with this id to get the game list:

import requests, json

endpoint = "https://graphql.epicgames.com/graphql"

gamesCollectionQuery = {
    "query":"\n            query catalogQuery($productNamespace:String!, $offerId:String!, $locale:String, $country:String!) {\n                Catalog {\n                    catalogOffer(namespace: $productNamespace, id: $offerId, locale: $locale) {\n                        title\n                        collectionOffers {\n                            \n          title\n          id\n          namespace\n          description\n          keyImages {\n            type\n            url\n          }\n          seller {\n              id\n              name\n          }\n          urlSlug\n          items {\n            id\n            namespace\n          }\n          customAttributes {\n            key\n            value\n          }\n          categories {\n            path\n          }\n          price(country: $country) {\n            totalPrice {\n              discountPrice\n              originalPrice\n              voucherDiscount\n              discount\n              fmtPrice(locale: $locale) {\n                originalPrice\n                discountPrice\n                intermediatePrice\n              }\n            }\n            lineOffers {\n              appliedRules {\n                id\n                endDate\n              }\n            }\n          }\n          linkedOfferId\n          linkedOffer {\n            effectiveDate\n            customAttributes {\n              key\n              value\n            }\n          }\n        \n                        }\n                        customAttributes {\n                            key\n                            value\n                        }\n                    }\n                }\n            }\n        ",
    "variables":{
        "productNamespace":"epic",
        "offerId":"7f22b3b15abc4821bba634340e2dd1ef",
        "locale":"es-ES",
        "country":"EN"
    }
}

data = requests.post(endpoint, headers={"Content-type": "application/json;charset=UTF-8"
                                       }, data=json.dumps(gamesCollectionQuery))

print(data.content)

这篇关于尝试使用 requests-html (Python 3.6) 抓取 JS 网页时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆