BeautifulSoup:刮蒸愿望清单游戏-.findAll不返回在检查器中可见的嵌套div [英] BeautifulSoup: Scraping steam wishlist games - .findAll not returning nested divs visible in inspector

查看:28
本文介绍了BeautifulSoup:刮蒸愿望清单游戏-.findAll不返回在检查器中可见的嵌套div的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我试图使用beautifulsoup从我的愿望清单中删除游戏.理想情况下,我想要游戏的名称,游戏的蒸汽商店页面的链接以及当前列出的价格.问题是,当我调用 soup.find_all("div",{"class":"wishlist_row"})时,尽管我能够看到应该有几个这样的列表,但它会返回一个空列表检查员希望列表中每个游戏的divs.这是我当前代码的精简版本:

So I am trying to scrape games off my steam wish-list using beautifulsoup. Ideally, I would like the name of the game, the link to the steam store page of the game and the currently listed price. The issue is that when I call soup.find_all("div", {"class": "wishlist_row"}) it returns an empty list despite me being able to see that there should be several of these divs for each game on my wish-list in the inspector. Here is a condensed version of my current code:

from bs4 import BeautifulSoup
import requests

profile_id = "id/Zorro4"

url_base = "https://store.steampowered.com/wishlist/"

r = requests.get(url_base + profile_id + "#sort=order", headers=header)

data = r.text

soup = BeautifulSoup(data, features="lxml")

# find divs containing information about game and steam price
divs = soup.findAll("div", {"class": "wishlist_row"})

print(divs)
>>> []

如果我转到 https,则可以在检查器中清楚地看到这些div.://store.steampowered.com/wishlist/id/zorro4/#sort=order 我尝试过

  • 使用html.parser代替lxml
  • 欺骗用户代理/标头
  • 尝试使用 .find("div",{"class":"wishlist_row"})
  • 通过这些线程进行查找

    我注意到一些奇怪的东西可能会帮助解决问题,但是我不确定该怎么做.

    I have noticed something odd that might help solve the problem but I am not sure what to make of it.

    soup.find(id="wishlist_ctn") # The div which should contain all the wishlist_row divs
    >>> <div id="wishlist_ctn">\n</div> 
    

    据我所知,应该返回< div id ="wishlist_ctn"> ...</div> ,因为div包含更多嵌套的div(我在寻找).我不确定为什么它只返回换行符.当抓取wishlist_ctn div的内容时,这几乎就像是丢失了.任何帮助将不胜感激,在过去的几天里,我一直试图解决这一问题,但没有成功.

    This, as far as I know, should return <div id="wishlist_ctn">...</div> since the div contains more nested divs (the ones I'm looking for). I am not sure why it just returns a newline character. It's almost as though when scraping the contents of the wishlist_ctn div gets lost. Any help would be super appreciated, I've been trying to solve this for the last couple days with no success.

    推荐答案

    您在网页上看到的数据是通过Javascript/JSON动态加载的.加载数据的URL位于HTML页面内部-我们可以使用 re 模块将其提取.

    The data you see on the webpage is loaded dynamically via Javascript/JSON. The URL, from where the data is loaded is inside the HTML page - we can use re module to extract it.

    此示例显示了愿望清单的JSON数据:

    This example prints the JSON data of the wishlist:

    import re
    import json
    import requests
    
    url = 'https://store.steampowered.com/wishlist/id/zorro4/#sort=order'
    wishlist_url =  json.loads( re.findall(r'g_strWishlistBaseURL = (".*?");', requests.get(url).text)[0] )
    
    data = requests.get(wishlist_url + 'wishlistdata/?p=0').json()
    print(json.dumps(data, indent=4))
    

    打印:

    {
        "50": {
            "name": "Half-Life: Opposing Force",
            "capsule": "https://steamcdn-a.akamaihd.net/steam/apps/50/header_292x136.jpg?t=1571756577",
            "review_score": 8,
            "review_desc": "Very Positive",
            "reviews_total": "5,383",
            "reviews_percent": 95,
            "release_date": "941443200",
            "release_string": "1 Nov, 1999",
            "platform_icons": "<span class=\"platform_img win\"></span><span class=\"platform_img mac\"></span><span class=\"platform_img linux\"></span>",
            "subs": [
                {
                    "id": 32,
    
    ...and so on.
    

    这篇关于BeautifulSoup:刮蒸愿望清单游戏-.findAll不返回在检查器中可见的嵌套div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆