如何从Morningstar上刮取数据 [英] How to scrape data off morningstar

查看:151
本文介绍了如何从Morningstar上刮取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我对Web抓取世界还是陌生的,到目前为止,我只是真正使用过beautifulsoup来从网站上抓取文本和图像.我以为可以尝试从图表上刮一些数据点来测试我的理解,但是对此

So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph.

在检查了要提取的数据元素后,我看到了: <span id="TSMAIN">: 100.7490637</span> 问题是,我最初抓取数据点的想法是要遍历包含所有不同数据点的某种ID列表(如果有意义).

After inspecting the element of the piece of data I wanted to extract, I saw this: <span id="TSMAIN">: 100.7490637</span> The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list containing all the different data points (if that makes sense?).

相反,似乎所有数据点都包含在同一元素内,并且值取决于光标在图形上的位置.

Instead, it seems that all the data points are contained within this same element, and the value depends on where your cursor is on the graph.

我的问题是,如果我使用beautifulsoups find函数并使用该属性id = TSMAIN键入该特定元素,则将返回无类型返回值,因为我猜测除非我将光标放在实际位置上图什么都不会出现.

My problem is, If I use beautifulsoups find function and type in that specific element with that attribute of id = TSMAIN, I get a none type return, because I am guessing unless I have my cursor on the actual graph nothing will show up there.

代码:

from bs4 import BeautifulSoup 
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR050AQ&tab=13"
source=requests.get(url,headers=headers)
soup = BeautifulSoup(source.content,'lxml')
data = soup.find("span",attrs={"id":"TSMAIN"})
print(data)

输出

None

如何提取该图的所有数据点?

How can I extract all the data points of this graph?

推荐答案

似乎可以从API提取数据.唯一的事情是它返回的值是相对于有效负载中输入的开始日期而言的.它将开始日期的输出设置为0,然后后面的数字是相对于该日期的.

Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0, then the numbers after are relative to that date.

import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta

userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}


n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')


url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'

data = []
idDict = {
        'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
        'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
        'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
        '':'F00000ZOR1]7]0]IXALL$$ALL',}


for k, v in idDict.items():
    payload = {
    'encyId': 'GBP',
    'idtype': 'Morningstar',
    'frequency': 'daily',
    'startDate':  dateStr,
    'performanceType': '',
    'outputType': 'COMPACTJSON',
    'id': v,
    'decPlaces': '8',
    'applyTrackRecordExtension': 'false'}
    
    
    temp_data = requests.get(url, params=payload).json()
    df = pd.DataFrame(temp_data)
    df['timestamp'] = pd.to_datetime(df[0], unit='ms')
    df['date'] = df['timestamp'].dt.date 
    df = df[['date',1]]  
    df.columns = ['date', k]
    data.append(df)         

final_df = pd.concat(
    (iDF.set_index('date') for iDF in data),
    axis=1, join='inner'
).reset_index()


final_df.plot(x="date", y=list(idDict.keys()), kind="line")

输出:

print (final_df.head(5).to_string())
         date  Schroder Managed Balanced Instl Acc  GBP Moderately Adventurous Allocation  Mixed Investment 40-85% Shares          
0  2019-12-22                             0.000000                               0.000000                        0.000000  0.000000
1  2019-12-23                             0.357143                               0.406784                        0.431372  0.694508
2  2019-12-24                             0.714286                               0.616217                        0.632422  0.667586
3  2019-12-25                             0.714286                               0.616217                        0.632422  0.655917
4  2019-12-26                             0.714286                               0.612474                        0.629152  0.664124
....

要获取这些ID,需要对请求进行一些调查.通过搜索这些内容,我能够找到相应的id值,并且经过反复试验才能确定出什么值表示什么.

To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.

那些备用"使用的ID.以及那些折线图从中获取数据的地方(在第4个请求中,查看预览"窗格,您将在其中看到数据.

Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.

这是最终的输出/图形:

Here's the final output/graph:

这篇关于如何从Morningstar上刮取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆