多线程抓取雅虎财经 [英] Multithreading to Scrape Yahoo Finance

查看:57
本文介绍了多线程抓取雅虎财经的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个程序来从 Yahoo! 中提取一些信息.金融.它作为 For 循环运行良好,但是它需要很长时间(7,000 个输入大约需要 10 分钟),因为它必须单独处理每个 request.get(url)(还是我对主要瓶颈的理解有误?)

I'm running a program to pull some info from Yahoo! Finance. It runs fine as a For loop, however it takes a long time (about 10 minutes for 7,000 inputs) because it has to process each request.get(url) individually (or am I mistaken on the major bottlenecker?)

无论如何,我遇到了多线程作为一个潜在的解决方案.这是我尝试过的:

Anyway, I came across multithreading as a potential solution. This is what I have tried:

import requests
import pprint
import threading

with open('MFTop30MinusAFew.txt', 'r') as ins: #input file for tickers
    for line in ins:
        ticker_array = ins.read().splitlines()

ticker = ticker_array
url_array = []
url_data = []
data_array =[]

for i in ticker:
    url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/'+i+'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US&region=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
    url_array.append(url) #loading each complete url at one time 

def fetch_data(url):
    urlHandler = requests.get(url)
    data = urlHandler.json()
    data_array.append(data)

pprint.pprint(data_array)

threads = [threading.Thread(target=fetch_data, args=(url,)) for url in url_array]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

fetch_data(url_array)

我得到的错误是 InvalidSchema: No connection adapters were found for '['https://query2.finance.... [url continue].

附注.我还读到使用多线程方法来抓取网站很糟糕/可能会让你被阻止.将雅虎!如果我一次从几千个股票中提取数据,财务会介意吗?当我按顺序执行它们时什么也没发生.

PS. I've also read that using multithread approach to scrape websites is bad/can get you blocked. Would Yahoo! Finance mind if I'm pulling data from a couple thousand tickers at once? Nothing happened when I did them sequentially.

推荐答案

如果您仔细查看该错误,您会注意到它没有显示一个 url,而是您附加的所有 url,并用括号括起来.实际上,您的代码的最后一行实际上使用完整数组作为参数调用了您的方法 fetch_data,这是没有意义的.如果您删除最后一行代码,则代码运行良好,并且您的线程会按预期调用.

If you look carefully at the error you will notice that it doesn't show one url but all the urls you appended, enclosed with brackets. Indeed the last line of your code actually call your method fetch_data with the full array as a parameter, which does't make sense. If you remove this last line the code runs just fine, and your threads are called as expected.

这篇关于多线程抓取雅虎财经的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆