Python-ETF每日数据网页爬虫 [英] Python - ETFs Daily Data Web Scraping

查看:554
本文介绍了Python-ETF每日数据网页爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过网络收集一些不同ETF的每日信息.我发现 https://www.marketwatch.com/具有准确的信息.最相关的信息是ETF的开盘价,流通股,资产净值,总资产.这是IVV美国股票的链接: https://www.marketwatch.com/investing/fund/ivv

I'm trying to web scrap some daily info of differents ETFs. I found that https://www.marketwatch.com/ have a accurate info. The most relevant info is the open Price, outstanding shares, NAV, total assets of the ETF. Here is the link for IVV US Equity: https://www.marketwatch.com/investing/fund/ivv

我才刚刚开始获得Python经验,想获得一些有关如何启动Web抓取程序的提示和准则.有人告诉我BeutifulSoup是用于网页抓取的软件包.

I'm just starting to get Python experience, would like to recieve some tips and guidelines on how to start a web scraping program. I have been told BeutifulSoup is the package to use for web scraping.

我以前使用VBA抓取过Web,但是我使用的页面的HTML不同,我不知道这是否是因为ETF的某些值(例如价格和交易量)不断变化.

I have web scraped with VBA before but the HTML of the pages I had used are different, I don't know if this is because some values of the ETFs (such as Price and Taded Volume) change constantly.

我愿意接受任何可能有用的建议或其他网站(我曾与Yahoo Finance和Morningstar进行过尝试,但HTML代码也遇到了同样的问题).

I am open to any suggestion or any other website that could be useful (I have tried with Yahoo Finance and Morningstar and I get the same problema with the HTML code).

推荐答案

是的,我同意美丽汤"是一种不错的方法.这是一些Python代码,使用Beautiful Soup库从IVV基金页面中提取日内价格:

Yes, I agree that Beautiful Soup is a good approach. Here is some Python code which uses the Beautiful Soup library to extract the intraday price from the IVV fund page:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.marketwatch.com/investing/fund/ivv")
html = r.text

soup = BeautifulSoup(html, "html.parser")

if soup.h1.string == "Pardon Our Interruption...":
    print("They detected we are a bot. We hit a captcha.")
else:
    price = soup.find("h3", class_="intraday__price").find("bg-quote").string
    print(price)

价格经常变动的事实不是问题. HTML标记的名称和类别将保持不变.这就是工作美丽汤所需要的.

The fact that the price changes frequently is not a problem. The names and classes of the HTML tags will remain constant. And this is all you need for Beautiful Soup to work.

您的主要挑战是该网站能够检测到您没有使用Internet浏览器,并且会显示您的Python脚本的验证码.因此,您将需要找到一种解决方法.另外,我建议您检查抓取的合法性以及是否违反了他们的服务条款.

Your main challenge is that the website is able to detect you are not using an Internet browser, and will display a captcha to your Python script. So you will need to find a method around this. Also, I recommend checking the legality of scraping and whether it violates their terms of service.

您可以在此处了解有关美丽汤"的更多信息:

You can learn more about Beautiful Soup here:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

这篇关于Python-ETF每日数据网页爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆