BeautifulSoup:刮蒸愿望清单游戏-.findAll不返回在检查器中可见的嵌套div [英] BeautifulSoup: Scraping steam wishlist games - .findAll not returning nested divs visible in inspector
问题描述
因此,我试图使用beautifulsoup从我的愿望清单中删除游戏.理想情况下,我想要游戏的名称,游戏的蒸汽商店页面的链接以及当前列出的价格.问题是,当我调用 soup.find_all("div",{"class":"wishlist_row"})
时,尽管我能够看到应该有几个这样的列表,但它会返回一个空列表检查员希望列表中每个游戏的divs.这是我当前代码的精简版本:
So I am trying to scrape games off my steam wish-list using beautifulsoup. Ideally, I would like the name of the game, the link to the steam store page of the game and the currently listed price. The issue is that when I call soup.find_all("div", {"class": "wishlist_row"})
it returns an empty list despite me being able to see that there should be several of these divs for each game on my wish-list in the inspector. Here is a condensed version of my current code:
from bs4 import BeautifulSoup
import requests
profile_id = "id/Zorro4"
url_base = "https://store.steampowered.com/wishlist/"
r = requests.get(url_base + profile_id + "#sort=order", headers=header)
data = r.text
soup = BeautifulSoup(data, features="lxml")
# find divs containing information about game and steam price
divs = soup.findAll("div", {"class": "wishlist_row"})
print(divs)
>>> []
如果我转到 https,则可以在检查器中清楚地看到这些div.://store.steampowered.com/wishlist/id/zorro4/#sort=order 我尝试过
- 使用html.parser代替lxml
- 欺骗用户代理/标头
- 尝试使用
.find("div",{"class":"wishlist_row"})
- 通过这些线程进行查找
我注意到一些奇怪的东西可能会帮助解决问题,但是我不确定该怎么做.
I have noticed something odd that might help solve the problem but I am not sure what to make of it.
soup.find(id="wishlist_ctn") # The div which should contain all the wishlist_row divs >>> <div id="wishlist_ctn">\n</div>
据我所知,应该返回
< div id ="wishlist_ctn"> ...</div>
,因为div包含更多嵌套的div(我在寻找).我不确定为什么它只返回换行符.当抓取wishlist_ctn div的内容时,这几乎就像是丢失了.任何帮助将不胜感激,在过去的几天里,我一直试图解决这一问题,但没有成功.This, as far as I know, should return
<div id="wishlist_ctn">...</div>
since the div contains more nested divs (the ones I'm looking for). I am not sure why it just returns a newline character. It's almost as though when scraping the contents of the wishlist_ctn div gets lost. Any help would be super appreciated, I've been trying to solve this for the last couple days with no success.推荐答案
您在网页上看到的数据是通过Javascript/JSON动态加载的.加载数据的URL位于HTML页面内部-我们可以使用
re
模块将其提取.The data you see on the webpage is loaded dynamically via Javascript/JSON. The URL, from where the data is loaded is inside the HTML page - we can use
re
module to extract it.此示例显示了愿望清单的JSON数据:
This example prints the JSON data of the wishlist:
import re import json import requests url = 'https://store.steampowered.com/wishlist/id/zorro4/#sort=order' wishlist_url = json.loads( re.findall(r'g_strWishlistBaseURL = (".*?");', requests.get(url).text)[0] ) data = requests.get(wishlist_url + 'wishlistdata/?p=0').json() print(json.dumps(data, indent=4))
打印:
{ "50": { "name": "Half-Life: Opposing Force", "capsule": "https://steamcdn-a.akamaihd.net/steam/apps/50/header_292x136.jpg?t=1571756577", "review_score": 8, "review_desc": "Very Positive", "reviews_total": "5,383", "reviews_percent": 95, "release_date": "941443200", "release_string": "1 Nov, 1999", "platform_icons": "<span class=\"platform_img win\"></span><span class=\"platform_img mac\"></span><span class=\"platform_img linux\"></span>", "subs": [ { "id": 32, ...and so on.
这篇关于BeautifulSoup:刮蒸愿望清单游戏-.findAll不返回在检查器中可见的嵌套div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!