BeautifulSoup html丢失 [英] BeautifulSoup html missing

查看:93
本文介绍了BeautifulSoup html丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取链接的URL,以在特定时间段内从Yahoo Finance下载资产的历史数据. 1999年1月1日至今.

I'm trying to get the url for the link to download historical data from Yahoo Finance for an asset during a specific timeframe. January 1, 1999 to present day.

例如,如果我去这里: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d

So for example if I go here: https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d

我想获取它(从数据表上方的下载数据"链接中):

I would want to acquire this (from the "Download Data" link above the table of data):

"https://query1.finance.yahoo.com/v7/finance/download/XLB?period1=915177600&period2=1498633200&interval=1d&events=history&crumb=iX6bJ6LfGxc"

我正在使用BeautifulSoup,并且遇到了所需标签的问题,该标签包含href不会显示在html中.起初,我认为在尝试使用find_all('a')并遍历子对象/后代对象没有任何结果后,BeautifulSoup只是无法正常工作.但是,当我对html进行文本转储时,html元素(以及父元素中的所有其他元素)不存在. 有人可以解释发生了什么吗?下面列出了我目前正在使用的工具.

I'm using BeautifulSoup and am running into the problem of the required tag that holds the href not showing up in the html. At first I thought BeautifulSoup was just not working properly after getting no results from trying to use find_all('a') and iterating through children/decendants. But when I did a text dump of the html, the html element (along with everything else within the parent element) was not there. Can someone please explain what is going on? What I'm currently working with is listed below.

from bs4 import BeautifulSoup
import datetime as dTime
import requests

"""
asset = "Materials"
assetSignal = "XLB"
today = dTime.datetime.now()
startTime = str(int(dTime.datetime(1999, 1, 1, 0, 0, 0).timestamp()))
endTime = str(int(dTime.datetime(today.year, today.month, today.day, 0, 0, 0).timestamp()))
url = "https://finance.yahoo.com/quote/" + assetSignal + "/history?period1=" + startTime + "&period2=" + endTime + "&interval=1d&filter=history&frequency=1d"
"""

url = "https://finance.yahoo.com/quote/XLB/history?period1=915177600&period2=1498633200&interval=1d&filter=history&frequency=1d"
page = requests.get(url)
data = page.content
#soup = BeautifulSoup(data, "html.parser")
soup = BeautifulSoup(data, "lxml")
#soup = BeautifulSoup(data, "xml")
#soup = BeautifulSoup(data, "html5lib")

#Link not found
for link in soup.find_all("a"):
    print(link.get("href"))

#Span is empty?
span = soup.find(class_="Fl(end) Pos(r) T(-6px)")
print(span)
print(span.string)
print(span.contents)
for child in span.children:
    print(child)

#Other span has children.  Target span doesn't
div = soup.find(class_="C($finDarkGray) Mt(20px) Mb(15px)")
print(div)
for child in div.descendants:
    print(child)

#Is the tag even there?
with open("soup.txt", "w") as file:
    file.write(page.text)

推荐答案

该网站高度依赖Javascript.您在浏览器中看到的许多信息都不会出现在您对网站的第一个请求中,而是随后的Javascript发出的其他请求中添加的.

This website relies heavily on Javascript. A lot of the information you see on your browser doesn't come in the first request you make to the website but it's added by subsequent Javascript making additional requests.

尝试改用他们的API或使用类似Selenium的类似网络浏览器的东西.

Try to use their API instead or use something like Selenium that emulates a web browser.

这篇关于BeautifulSoup html丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆