使用BeautifulSoup从investing.com为BTC/ETH抓取数据 [英] Scraping data from investing.com for BTC/ETH using BeautifulSoup

查看:15
本文介绍了使用BeautifulSoup从investing.com为BTC/ETH抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一些代码来从投资网站上抓取 BTC/ETH 时间序列,并且运行良好.但是,我需要更改请求调用,以便下载的数据来自 Kraken 而不是默认的 bitfinex,而是来自 01/06/2016 而不是默认的开始时间.这个选项可以在网页上手动设置,但我不知道如何通过请求调用发送它,除了它可能涉及使用数据"参数.感谢任何建议.

谢谢,

公里

代码已经用 python 编写并且适用于默认值

导入请求从 bs4 导入 BeautifulSoup导入操作系统将 numpy 导入为 np# BTC 抓取 https://www.investing.com/crypto/bitcoin/btc-usd-historical-data# ETH 抓取 https://www.investing.com/crypto/ethereum/eth-usd-historical-dataticker_list = [x.strip() for x in open("F:\System\PVWAVE\Crypto\tickers.txt", "r").readlines()]urlheader = {"用户代理": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36","X-Requested-With": "XMLHttpRequest"}打印(股票代码:",len(ticker_list))对于ticker_list 中的代码:打印(代码)url = "https://www.investing.com/crypto/"+ticker+"-historical-data"req = requests.get(url, headers=urlheader, data=payload)汤 = BeautifulSoup(req.content, "lxml")table = soup.find('table', id="curr_table")split_rows = table.find_all("tr")newticker=ticker.replace('/','\')output_filename = "F:\System\PVWAVE\Crypto\{0}.csv".format(newticker)os.makedirs(os.path.dirname(output_filename),exist_ok=True)输出文件=打开(输出文件名,'w')header_list = split_rows[0:1]split_rows_rev = split_rows[:0:-1]对于 header_list 中的行:列 = 列表(row.stripped_strings)columns = [column.replace(',','') for column in columns]如果 len(columns) == 7:output_file.write("{0}, {1}, {2}, {3}, {4}, {5}, {6} 
".format(columns[0], columns[2], columns[3]、列[4]、列[1]、列[5]、列[6]))对于 split_rows_rev 中的行:列 = 列表(row.stripped_strings)columns = [column.replace(',','') for column in columns]如果 len(columns) == 7:output_file.write("{0}, {1}, {2}, {3}, {4}, {5}, {6} 
".format(columns[0], columns[2], columns[3]、列[4]、列[1]、列[5]、列[6]))output_file.close()

为默认交换和默认日期范围下载数据,但我想指定 Kraken 和默认开始和结束时间(01/06/16 和最后一整天,即总是昨天)

解决方案

小背景

有很多网站使用称为表单的东西根据用户活动(例如您填写用户名和密码的登录页面)将数据发送到服务器,或者当您单击按钮时.类似的事情正在这里发生.

我怎么知道的?

  • 更改默认页面并转到Kraken 历史数据页面.您会看到网址已更改为

    现在该怎么办?

    您需要聪明一点,对您的 Python 代码进行 3 更改.

    • 将请求从 GET 更改为 POST.
    • 发送表单数据作为该请求的负载.
    • 将网址更改为您刚刚在标题标签中看到的网址.

      url = "https://www.investing.com/instruments/HistoricalDataAjax"

      payload = {'header': 'BTC/USD Kraken 历史数据', 'st_date': '12/01/2018', 'end_date': '12/01/2018', 'sort_col': 'date', 'action': 'historical_data', 'smlID': '145284', 'sort_ord': 'DESC', 'interval_sec': 'Daily', 'curr_id': '49799'}

      requests.post(url, data=payload, headers=urlheader)

    进行上述更改,并让代码的其他部分保持不变.你会得到你想要的结果.您也可以根据需要修改日期.

    I have written some code to scrape BTC/ETH time series from investing.com and it works fine. However I need to alter the requests call so that the downloaded data is from Kraken not the bitfinex default and from 01/06/2016 instead of the default start time. This options can be set manually on the web page but I have no idea how to send that via the requests call except that it may involve using a the "data" parameter. Grateful for any advice.

    Thanks,

    KM

    Code already written in python and works fine for defaults

    import requests
    from bs4 import BeautifulSoup
    import os
    import numpy as np
    
    # BTC scrape https://www.investing.com/crypto/bitcoin/btc-usd-historical-data
    # ETH scrape https://www.investing.com/crypto/ethereum/eth-usd-historical-data
    
    ticker_list = [x.strip() for x in open("F:\System\PVWAVE\Crypto\tickers.txt", "r").readlines()]
    urlheader = {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
      "X-Requested-With": "XMLHttpRequest"
    }
    
    print("Number of tickers: ", len(ticker_list))
    
    for ticker in ticker_list:
        print(ticker)
        url = "https://www.investing.com/crypto/"+ticker+"-historical-data"
        req = requests.get(url, headers=urlheader, data=payload)
        soup = BeautifulSoup(req.content, "lxml")
    
        table = soup.find('table', id="curr_table")
        split_rows = table.find_all("tr")
    
        newticker=ticker.replace('/','\')
    
        output_filename = "F:\System\PVWAVE\Crypto\{0}.csv".format(newticker)
        os.makedirs(os.path.dirname(output_filename), exist_ok=True)
        output_file = open(output_filename, 'w')
        header_list = split_rows[0:1]
        split_rows_rev = split_rows[:0:-1]
    
        for row in header_list:
            columns = list(row.stripped_strings)
            columns = [column.replace(',','') for column in columns]
            if len(columns) == 7:
                output_file.write("{0}, {1}, {2}, {3}, {4}, {5}, {6} 
    ".format(columns[0], columns[2], columns[3], columns[4], columns[1], columns[5], columns[6]))
    
        for row in split_rows_rev:
            columns = list(row.stripped_strings)
            columns = [column.replace(',','') for column in columns]
            if len(columns) == 7:
                output_file.write("{0}, {1}, {2}, {3}, {4}, {5}, {6} 
    ".format(columns[0], columns[2], columns[3], columns[4], columns[1], columns[5], columns[6]))
    
        output_file.close()
    

    Data is downloaded for default exchange and default date range but I want to specify Kraken and default start and end times (01/06/16 and last full day ie always yesterday)

    解决方案

    Little background

    There are lots of websites out there that use something called forms to send data to the server, based on user activity (like log-in pages where you fill your user-name and password) or when you click on a button. Something like that is going on here.

    How did I know it?

    • Change the default page and go over to the Kraken historical data page. You will see that the url has changed to https://www.investing.com/crypto/bitcoin/btc-usd-historical-data?cid=49799.
    • Now, right click on the page and click on Inspect. Look at the top row of the split screen that just opened closely. Click on the Networks tab. This tab will show you the request/response cycle of any web page that you visit in the browser.
    • Search for the Clear button just beside the red button that you see and click it. Now, you have a clean slate. You will be able to see the request being sent to the server when you change the date on that page.
    • Change the dates according to your need and then Click on Apply. You will see that a request by the name HistoricalDataAjax was sent to the server(Refer the attached image below for more clarity). Click on it and scroll down in the Headers tab. You can see a section called Form Data. This is the extra hidden(yet-not-so-hidden) information that is being sent to the server. It is being sent as a POST request since you do not see any change in the url.
    • You can also see in the same Headers section that the Request URL is https://www.investing.com/instruments/HistoricalDataAjax

    What to do now?

    You need to be smart and make 3 changes in your python code.

    • Change the request from GET to POST.
    • Send the Form Data as payload for that request.
    • Change the url to the one you just saw in the Headers tab.

      url = "https://www.investing.com/instruments/HistoricalDataAjax"

      payload = {'header': 'BTC/USD Kraken Historical Data', 'st_date': '12/01/2018', 'end_date': '12/01/2018', 'sort_col': 'date', 'action': 'historical_data', 'smlID': '145284', 'sort_ord': 'DESC', 'interval_sec': 'Daily', 'curr_id': '49799'}

      requests.post(url, data=payload, headers=urlheader)

    Make the above mentioned changes and let other parts of your code remain the same. You will get the results you want. You can modify the dates according to your needs too.

    这篇关于使用BeautifulSoup从investing.com为BTC/ETH抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆