如何提高Python循环的性能? [英] How to the increase performance of a Python loop?

查看:59
本文介绍了如何提高Python循环的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个将近1400万行的DataFrame.我正在处理财务期权数据,理想情况下,我需要根据到期日为每个期权利率(称为无风险利率).根据我所遵循的文献,一种方法是获取美国国债的利率,然后针对每种期权,检查其到期日最接近该期权到期时间的国债利率是多少(绝对值).条款).为此,我创建了一个循环,该循环将用这些差异填充数据框.我的代码远非优雅,而且有点混乱,因为存在日期和到期日的组合,而没有利率.因此循环内的条件.循环完成后,我可以查看具有最小绝对差的到期日,然后选择该到期日的利率.该脚本花费了很长时间才能运行,以至于我添加了tqdm来对正在发生的事情进行某种反馈.

I have a DataFrame with almost 14 million rows. I am working with financial options data and ideally I need an interest rate (called risk-free rate) for each option according to it’s time to maturity. According to the literature I’m following, one way to do this is to get US Treasury Bonds interest rates and, for each option, check what is the Treasury Bond rate whose maturity is closest to the time to maturity of the option (in absolute terms). To achieve this I created a loop that will fill a Dataframe with those differences. My code is far from elegant and it is a bit messy because there are combinations of dates and maturities for which there are no rates. Hence the conditionals inside the loop. After the loop is done I can look at what is the maturity with the lowest absolute difference and choose the rate for that maturity. The script was taking so long to run that I added tqdm to have some kind of feedback of what is happening.

我尝试运行代码.这将需要几天的时间才能完成,并且随着迭代次数的增加而变慢(我从tqdm知道这一点).最初,我是使用DataFrame.loc将行添加到差异DataFrame中的.但是由于我认为这是代码随时间推移而变慢的原因,因此我切换到了DataFrame.append.代码仍然很慢,并且随着时间的流逝而变慢.

I tried running the code. It will take days to complete and it is slowing down as the iterations increase (I know this from tqdm). At first I was adding rows to the differences DataFrame using DataFrame.loc. But as I thought that was the reason the code was slowing down over time, I switched to DataFrame.append. The code is still slow and slowing down over time.

我搜索了一种提高性能的方法,并发现了以下问题:如何加快速度python循环.有人建议使用Cython,但老实说,我仍然认为自己是Python的初学者,因此从示例来看,这似乎并不容易.那是我最好的选择吗?如果要花很多时间来学习,那么我也可以做其他人在文献中所做的事情,并且将所有选项都使用3个月的利率.但是我宁愿不要去那里.也许还有其他(简单)答案可以解决我的问题,请让我知道.我提供了一个可重现的代码示例(尽管只有2行数据):

I searched for a way to increase performance and found this question: How to speed up python loop. Someone suggests using Cython but honestly I still consider myself a beginner to Python so from looking at the examples it doesn’t seem something trivial to do. Is that my best option? If it takes a lot of time to learn than I can also do what others do in the literature and just use the 3-month interest rate for all options. But I would prefer not to go there there. Maybe there are other (easy) answers to my problem, please let me know. I include a reproducible code example (although with only 2 rows of data):

from tqdm import tqdm
import pandas as pd


# Treasury maturities, in years
treasury_maturities = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]

# Useful lists
treasury_maturities1 = [3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities2 = [1/12]
treasury_maturities3 = [6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities4 = [1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities5 = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20]

# Dataframe that will contain the difference between the time to maturity of option and the different maturities
differences = pd.DataFrame(columns = treasury_maturities)


# Options Dataframe sample
options_list = [[pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "c",    309.1, 311.1, 1108.49, 1108.49, 0.0410958904109589, 310.1], [pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "p", 0.0, 0.05, 1108.49, 1108.49, 0.0410958904109589, 0.025]]

options = pd.DataFrame(options_list, columns = ['QuoteDate', 'expiration', 'strike', 'OptionType', 'bid_eod', 'ask_eod', 'underlying_bid_eod', 'underlying_ask_eod', 'Time_to_Maturity', 'Option_Average_Price'])


# Loop
for index, row in tqdm(options.iterrows()):
    if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2018-10-15"):
        if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2006-02-08") and row.Time_to_Maturity > 25:
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities5])
            list_s = [list_s + [40]] # 40 is an arbitrary number bigger than 30
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True) 
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 1.5/12 <= row.Time_to_Maturity <= 3.5/12:
            list_s = [0, 40, 40]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities3]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 3.5/12 < row.Time_to_Maturity <= 4.5/12:    
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2])
            list_s = list_s + [40, 40, 0]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities4]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        else:
            if 1.5/12 <= row.Time_to_Maturity <= 2/12:
                list_s = [0, 40]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities1]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            elif 2/12 < row.Time_to_Maturity <= 2.5/12:
                list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities2])
                list_s = list_s + [40, 0]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities3]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            else:
                list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
    else:        
        list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
        differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)

推荐答案

简短回答

循环和if语句都是计算量大的操作,因此请寻找减少使用次数的方法.

循环优化:-加快编程循环的最佳方法是将尽可能多的计算移出循环移出.

Loop Optimization: - The best way to speed up a programming loop is to move as much computation as possible out of the loop.

DRY :-不要重复自己.您有多个多余的if条件,请查看嵌套的if条件并遵循DRY原理.

DRY: - Don't Repeat Yourself. You have several redundant if conditions, look into nested if conditions and follow the DRY principle.

诸如pandas和numpy之类的库的主要优点之一是,它们旨在提高数组数学运算的效率(请参阅

One of the main benefits of libraries such as pandas and numpy is that they are designed for efficiency in mathematical operations on arrays (see Why are numpy arrays so fast?). This means you usually do not have to use loops at all. Instead of creating a new DataFrame inside your loop, create a new column for each value you are computing.

要克服针对不同日期等使用不同逻辑的问题,请过滤行并应用逻辑,请使用掩码/过滤器仅选择您需要对其进行操作的行,而不要使用if语句(请参见

To overcome the issue of different logic for different dates etc, filter rows and apply logic, use a mask/filter to select only the rows you need to operate on instead of using if statements (see pandas filtering tutorial).

此代码不是您的逻辑的复制,而是有关如何实现的示例.它不是完美的,但应该提供一些重大的效率改进.

This code is not a replication of your logic, but an example of how it could be implemented. It's not perfect, but should provide some major efficiency improvements.

import pandas as pd
import numpy as np

# Maturity periods, months and years
month_periods = np.array([1, 2, 3, 6, ], dtype=np.float64)
year_periods = np.array([1, 2, 3, 4, 5, 7, 10, 20, 30, ], dtype=np.float64)

# Create column names for maturities
maturity_cols = [f"month_{m:02.0f}" for m in month_periods] + [f"year_{y:02.0f}" for y in year_periods]

# Normalise months  & concatenate into single array
month_periods = month_periods / 12
maturities = np.concatenate((month_periods, year_periods))

# Create some dummy data
np.random.seed(seed=42)  # Seed PRN generator
date_range = pd.date_range(start="2004-01-01", end="2021-01-30", freq='D')  # Dates to sample from
dates = np.random.choice(date_range, size=n_records, replace=True)
maturity_times = np.random.random(size=n_records)
options = pd.DataFrame(list(zip(dates, maturity_times)), columns=['QuoteDate', 'Time_to_Maturity', ])

# Create date masks
after = options['QuoteDate'] >= pd.to_datetime("2008-01-01")
before = options['QuoteDate'] <= pd.to_datetime("2015-01-01")

# Combine date masks / create flipped version
between = after & before
outside = np.logical_not(between)

# Select data with masks
df_outside = options[outside].copy()
df_between = options[between].copy()

# Smaller dataframes
df_a = df_between[df_between['Time_to_Maturity'] > 25].copy()
df_b = df_between[df_between['Time_to_Maturity'] <= 3.5 / 12].copy()
df_c = df_between[df_between['Time_to_Maturity'] <= 4.5 / 12].copy()
df_d = df_between[
    (df_between['Time_to_Maturity'] >= 2 / 12) & (df_between['Time_to_Maturity'] <= 4.5 / 12)].copy()

# For each maturity period, add difference column using different formula
for i, col in enumerate(maturity_cols):
    # Add a line here for each subset / chunk of data which requires a different formula
    df_a[col] = ((maturities[i] - df_outside['Time_to_Maturity']) + 40).abs()
    df_b[col] = ((maturities[i] - df_outside['Time_to_Maturity']) / 2) .abs()
    df_c[col] = (maturities[i] - df_outside['Time_to_Maturity'] + 1).abs()
    df_d[col] = (maturities[i] - df_outside['Time_to_Maturity'] * 0.8).abs()
    df_outside[col] = (maturities[i] - df_outside['Time_to_Maturity']).abs()

# Concatenate dataframes back to one dataset
frames = [df_outside, df_a, df_b, df_c, df_d, ]
output = pd.concat(frames).dropna(how='any')

output.head()

记录数量的平均执行时间
数百万条记录被快速处理(允许存储)|记录|旧时间(秒)|新时间(秒)|改进||-|-|-|-|||10 |0.0105 |0.0244 |-132.38%||100 |0.1078 |0.0249 |76.90%||1,000(1k)|1.03 |0.0249 |97.58%||10,000(10k)|15.629 |0.0322 |99.79%||100,000(100k)|182.014 |0.065 |99.96%||1,000,000(1m)|?|0.4014 |?||10,000,000(10m)|?|4.7488 |?||14,000,000(14m)|?|6.0172 |?||100,000,000(100m)|?|83.286 |?|

Average execution time for number of records
Even millions of records are processed quickly (Memory Allowing) | Record | Old Time (secs) | New Time (secs)| Improvement | |-|-|-|-| | 10 | 0.0105 | 0.0244 | -132.38% | | 100 | 0.1078 | 0.0249 | 76.90% | | 1,000 (1k) | 1.03 | 0.0249 | 97.58% | | 10,000 (10k) | 15.629 | 0.0322 | 99.79% | | 100,000 (100k) | 182.014 | 0.065 | 99.96% | | 1,000,000 (1m) | ? | 0.4014 | ? | | 10,000,000 (10m) | ? | 4.7488 | ? | | 14,000,000 (14m) | ? | 6.0172 | ? | | 100,000,000 (100m) | ? | 83.286 | ? |

优化和配置了基本代码后,您还可以研究多线程,并行化代码或使用其他语言.另外, 1400万条记录将消耗大量RAM -远远超出大多数工作站的处理能力.要解决此限制,您可以分块读取文件本身,并一次对一个块执行计算:

Once you have optimized and profiled your basic code you can also look into multithreading, parallelising you code, or using a different language. Also 14 million records will eat up a lot of RAM - much more than most workstations can handle. To get around this limitation you can read the file itself in chunks and perform your calculations on one chunk at a time:

result_frames = []
for chunk in pd.read_csv("voters.csv", chunksize=10000):
    # Do things here
    result = chunk
    result_frames.append(result)

Google搜索字词:多处理/线程化/Dask/PySpark

Google search terms: multiprocessing / threading / Dask / PySpark

这篇关于如何提高Python循环的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆