使用dataframe.mean()时运行时间的怪异增长(Pandas性能非数字列) [英] Weird exponential increase in running time when using dataframe.mean() (Pandas performance non-numeric column)
问题描述
I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried the second answer of this question;
归结为normalized_df=(df-df.mean(axis=0))/df.std(axis=0)
但是,执行此代码需要很长的时间.因此,我开始调查,看来df.mean()
调用花费的时间呈指数增长.
However, it takes a very long time to execute this code. Therefore, I started investigating, and it seems that the time that the df.mean()
call takes is increasing exponentially.
我已使用以下代码测试运行时:
I've used the following code to test run-times:
import pandas as pd
import time
jena_climate_df = pd.read_csv("jena_climate_2009_2016.csv")
start = time.time()
print(jena_climate_df[:200000].mean(axis=0)) #Modify the number of rows here to observe the increase in time
stop = time.time()
print(f"{stop-start} Seconds for mean calc")
我进行了一些测试,选择逐渐增加用于平均值计算的行数.查看以下结果:
I ran some tests, selecting increasing the number of rows I use for the mean calculation gradually. See the results below:
0.004987955093383789 Seconds for mean calc ~ 10 observations
0.009006738662719727 Seconds for mean calc ~ 1000 observations
0.0837397575378418 Seconds for mean calc ~ 10000 observations
1.789750337600708 Seconds for mean calc ~ 50000 observations
7.518809795379639 Seconds for mean calc ~ 60000 observations
19.989460706710815 Seconds for mean calc ~ 70000 observations
71.97900629043579 Seconds for mean calc ~ 100000 observations
375.04513001441956 Seconds for mean calc ~ 200000 observations
在我看来,时间正成倍增加.我不知道为什么会这样,AFAIK将所有值相加并将其除以观察次数将不会占用过多的计算量,但也许我在这里是错的.一些解释将不胜感激!
It seems to me that the time is increasing exponentially. I don't know why this is happening, AFAIK adding all values and dividing them by the number of observations shouldn't be too computationally intensive but maybe I am wrong here. Some explanation would be greatly appreciated!
推荐答案
我做了一些测试,在这种情况下,罪魁祸首似乎是约会时间"-非数字列.
I did some tests, and it seems that the culprit, in this case, is "Date Time" - the non-numeric column.
首先,当自己计算不同列的平均值时,显然没有指数行为(请参见下图-X轴是行数,y轴是时间).
First, when calculating the mean for different columns on their own, there's clearly no exponential behavior (see chart below - the X axis is the number of rows, the y-axis is time).
第二,然后我尝试在下面计算整个数据帧的均值
三个场景(每个场景有80K行),并使用%%timeit
对其计时:
Second, I then tried to calculate means for the entire data frame in the following
three scenarios (each with 80K rows), and timed it with %%timeit
:
-
jena_climate_df[0:80000].mean(axis=0)
:10.2秒. - 将日期/时间列设置为索引:
jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms
(约占先前测试的0.4%). - 最后,删除日期/时间列:
jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0)
-19.8毫秒(原始时间的0.2%).
jena_climate_df[0:80000].mean(axis=0)
: 10.2 seconds.- Setting the date/time column to an index:
jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms
(about 0.4% of the previous test). - And finally, dropping the date/time column:
jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0)
- 19.8 ms (0.2% of the original time).
希望这会有所帮助.
这篇关于使用dataframe.mean()时运行时间的怪异增长(Pandas性能非数字列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!