使用dataframe.mean()时运行时间的怪异增长(Pandas性能非数字列) [英] Weird exponential increase in running time when using dataframe.mean() (Pandas performance non-numeric column)

查看:173
本文介绍了使用dataframe.mean()时运行时间的怪异增长(Pandas性能非数字列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用天气数据集(要重现;可以找到数据

I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried the second answer of this question;

规范化熊猫数据框的列

归结为normalized_df=(df-df.mean(axis=0))/df.std(axis=0)

但是,执行此代码需要很长的时间.因此,我开始调查,看来df.mean()调用花费的时间呈指数增长.

However, it takes a very long time to execute this code. Therefore, I started investigating, and it seems that the time that the df.mean() call takes is increasing exponentially.

我已使用以下代码测试运行时:

I've used the following code to test run-times:

import pandas as pd
import time

jena_climate_df = pd.read_csv("jena_climate_2009_2016.csv")
start = time.time()
print(jena_climate_df[:200000].mean(axis=0)) #Modify the number of rows here to observe the increase in time
stop = time.time()
print(f"{stop-start} Seconds for mean calc")

我进行了一些测试,选择逐渐增加用于平均值计算的行数.查看以下结果:

I ran some tests, selecting increasing the number of rows I use for the mean calculation gradually. See the results below:

0.004987955093383789 Seconds for mean calc ~ 10 observations
0.009006738662719727 Seconds for mean calc ~ 1000 observations
0.0837397575378418 Seconds for mean calc ~ 10000 observations
1.789750337600708 Seconds for mean calc ~ 50000 observations
7.518809795379639 Seconds for mean calc ~ 60000 observations
19.989460706710815 Seconds for mean calc ~ 70000 observations
71.97900629043579 Seconds for mean calc ~ 100000 observations
375.04513001441956 Seconds for mean calc ~ 200000 observations

在我看来,时间正成倍增加.我不知道为什么会这样,AFAIK将所有值相加并将其除以观察次数将不会占用过多的计算量,但也许我在这里是错的.一些解释将不胜感激!

It seems to me that the time is increasing exponentially. I don't know why this is happening, AFAIK adding all values and dividing them by the number of observations shouldn't be too computationally intensive but maybe I am wrong here. Some explanation would be greatly appreciated!

推荐答案

我做了一些测试,在这种情况下,罪魁祸首似乎是约会时间"-非数字列.

I did some tests, and it seems that the culprit, in this case, is "Date Time" - the non-numeric column.

首先,当自己计算不同列的平均值时,显然没有指数行为(请参见下图-X轴是行数,y轴是时间).

First, when calculating the mean for different columns on their own, there's clearly no exponential behavior (see chart below - the X axis is the number of rows, the y-axis is time).

第二,然后我尝试在下面计算整个数据帧的均值 三个场景(每个场景有80K行),并使用%%timeit对其计时:

Second, I then tried to calculate means for the entire data frame in the following three scenarios (each with 80K rows), and timed it with %%timeit:

  • jena_climate_df[0:80000].mean(axis=0):10.2秒.
  • 将日期/时间列设置为索引:jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms(约占先前测试的0.4%).
  • 最后,删除日期/时间列:jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0)-19.8毫秒(原始时间的0.2%).
  • jena_climate_df[0:80000].mean(axis=0) : 10.2 seconds.
  • Setting the date/time column to an index: jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms (about 0.4% of the previous test).
  • And finally, dropping the date/time column: jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0) - 19.8 ms (0.2% of the original time).

希望这会有所帮助.

这篇关于使用dataframe.mean()时运行时间的怪异增长(Pandas性能非数字列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆