遍历API调用的Pandas DataFrame的最快方法 [英] Fastest way to loop over Pandas DataFrame for API calls

查看:429
本文介绍了遍历API调用的Pandas DataFrame的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是为Pandas DataFrame中的每一行调用一个API,该API在响应JSON中包含一个字符串列表,并创建一个新的DataFrame,每个响应包含一行.我的代码基本上是这样的:

My objective is to make a call to an API for each row in a Pandas DataFrame, which contains a List of strings in the response JSON, and creating a new DataFrame with one row per response. My code basically looks like this:

i = 0
new_df = pandas.DataFrame(columns = ['a','b','c','d'])
for index,row in df.iterrows():
    url = 'http://myAPI/'
    d = '{"SomeJSONData:"' + row['data'] + '}'
    j = json.loads(d)
    response = requests.post(url,json = j)

    data = response.json()
    for new_data in data['c']:
        new_df.loc[i] = [row['a'],row['b'],row['c'],new_data]
        i += 1

这可以正常工作,但是我要进行约5500个API调用,并向新的DataFrame中写入约6500行,所以要花一些时间,也许要花10分钟.我想知道是否有人知道一种加快速度的方法?我对在Python中运行并行for循环不太熟悉,可以在保持线程安全性的同时做到这一点吗?

This works fine, but I'm making about 5500 API calls and writing about 6500 rows to the new DataFrame so it takes a while, maybe 10 minutes. I was wondering if anyone knew of a way to speed this up? I'm not too familiar with running parallel for loops in Python, could this be done while maintaining thread safety?

推荐答案

也许遵循这些思路?这样,您无需创建一个新的数据框,只声明一次URL,并利用了熊猫列操作比逐行操作更快的事实.

Something along these lines perhaps? This way you aren't creating a whole new dataframe, you're only declaring URL once, and you're taking advantage of the fact that pandas column operations are faster than row by row stuff.

url = 'http://myAPI/'

def request_function(j):
    return requests.post(url,json = json.loads(j))['c'] 

df['j']= '{"SomeJsonData:"' + df['data'] + '}'
df['new_data'] = df['j'].apply(request_function)

现在要证明在这种情况下使用Apply(字符串数据)确实要快得多,这是一个简单的测试:

Now to prove that using apply in this case ( String data ) is indeed much faster, here's a simple test:

import numpy as np
import pandas as pd
import time

def func(text):
    return text + ' is processed'


def test_one():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()
    data['text'] = data['text'].apply(func)
    print(time.time() - start)


def test_two():
    data =pd.DataFrame(columns = ['text'], index = np.arange(0, 100000))
    data['text'] = 'text'

    start = time.time()

    for index, row in data.iterrows():
        data.loc[index, 'text'] = row['text'] + ' is processed'

    print(time.time() - start)

对数据帧进行字符串操作的结果.

Results of string operations on dataframes.

test_one(使用套用):0.023002147674560547

test_one(using apply) : 0.023002147674560547

test_two(使用迭代):18.912891149520874

test_two(using iterrows): 18.912891149520874

基本上,通过使用添加两列并应用的内置pandas操作,您应该得到更快的结果,响应时间确实受到API响应时间的限制.如果结果仍然太慢,则可能需要考虑编写一个将结果保存到列表的异步函数.然后,您发送并应用该异步功能.

Basically, by using the built-in pandas operations of adding the two columns and apply, you should have somewhat faster results, your response time is indeed limited by the API response time. If the results are still too slow, you might what to consider writing an async function that saves the results to a list. Then you send.apply that async function.

这篇关于遍历API调用的Pandas DataFrame的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆