pandas :DataFrame.mean()非常慢。如何更快地计算列均值? [英] pandas: DataFrame.mean() very slow. How can I calculate means of columns faster?

查看:128
本文介绍了 pandas :DataFrame.mean()非常慢。如何更快地计算列均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的CSV文件,它包含9917530行(无标题)和54列。列是实数或整数,只有一个包含日期。文件上有一些NULL值,在我将其加载到熊猫 DataFrame 后,这些值会转换为 nan 我确实喜欢这样:

 将熊猫作为pd 
data = pd.read_csv('data.csv')

加载后,我认为它非常快,导致它花费了大约30秒(几乎是同一时间)用Unix工具 wc 计算行数时,该过程占用了大约4Gb的RAM(磁盘上文件的大小:2.2 Gb。到目前为止,一切都很好。 / p>

然后我尝试执行以下操作:

  column_means =数据。 mean()

该进程的占用内存非常快地增长到〜22Gb。我也可以看到处理器(一个内核)非常繁忙-大约三个小时后,我终止了该进程,因为我需要将该计算机用于其他用途。我有一台运行Linux的非常快的PC-它有2个处理器,每个处理器有4个内核,所以这8个核心都齐头并进r和32 Gb的RAM。我不敢相信计算列均值会花费这么长时间。



有人能解释为什么 DataFrame.mean()如此吗慢?更重要的是,这样计算文件的列均值的更好方法是什么?我不是以最佳方式加载文件吗?我应该使用其他函数代替 DataFrame.mean()还是使用完全不同的工具?



非常感谢。



编辑。以下是 df.info()显示的内容:

 < class 'pandas.core.frame.DataFrame'> 
Int64Index:9917530条目,0到9917529
数据列(总共54列):
srch_id 9917530非空值
date_time 9917530非空值
site_id 9917530非空值
visitor_location_country_id 9917530非空值
visitor_hist_starrating 505297非空值
visitor_hist_adr_usd 507612非空值
prop_country_id 9917530非空值
prop_id 9917530非空值
prop_starrating 9917530非空值
prop_review_score 9902900非空值
prop_brand_bool 9917530非空值
prop_location_score1 9917530非空值
prop_location_score2 7739150非空值
prop_log_historical_price 9917530非空值
头寸9917530非空值
price_usd 9917530非空值
Promotion_flag 9917530非空值
srch_destination_id 9917530非空值
srch_length_of_stay 9917530非空值
srch_booking_window 9917530非空值$ b srch_adults_count 9917530非空值
srch_children_count 9917530非空值
srch_room_count 9917530非空值
srch_saturday_night_bool 9917530非空值
srch_query $ b orig_destination_distance 6701069非空值
random_bool 9917530非空值
comp1_rate 235806非空值
comp1_inv 254433非空值
comp1_rate_percent_diff 184907非空值
comp2_rate 4040633非空值
comp2_inv 4251538非空值
comp2_rate_percent_diff 1109847非空值
comp3_rate 3059273非空值
comp3_inv 3292221非空值
comp3_rate_percent_diff 944007非空值
comp4_rate 620099非空值
comp4_inv 692471非空值
comp4_rate_percent_diff 264213非空值
comp5_rate 4444294非空值
comp5_inv 4720833非空值
comp5_rate_percent_diff 1681006非空值
comp6_rate 482487非空值
comp6_inv 524145非空值
comp6_rate_percent_diff 193312非空值
comp7_rate 631077非空值
comp7_inv 713175非空值
comp7_rate_percent_diff 277838非空值
comp8_rate 3819043非空值
comp8_inv 3960388非空值
comp8_rate_percent_diff 1225707非空值
click_bool 9917530非空值
gross_bookings_usd 276592非空值
booking_bool 9917530非空值
dtypes:float64(34),int64(19),object(1)无


解决方案

这里的大小与相似,但没有对象列

 在[10]中:nrows = 10000000 

在[11]中:df = pd.concat([DataFrame(randn(int(nrows),34),columns = ['f%s' %i for range(34)中的i)),DataFrame(randint(0,10,size = int(nrows * 19))。reshape(int(nrows),19),columns = ['i%s'%i对于范围(19)中的i,]]],轴= 1)

在[12]中:df.iloc [1000:10000,0:20] = np.nan

在[13]中:df.info()
< class'pandas.core.frame.DataFrame'>
Int64Index:10000000个条目,0到9999999
数据列(共53列):
f0 9991000非空值
f1 9991000非空值
f2 9991000非空值
f3 9991000非空值
f4 9991000非空值
f5 9991000非空值
f6 9991000非空值
f7 9991000非空值
f8 9991000非空值
f9 9991000非空值
f10 9991000非空值
f11 9991000非空值
f12 9991000非空值
f13 9991000非空值
f14 9991000非空值
f15 9991000非空值
f16 9991000非空值
f17 9991000非空值
f18 9991000非空值
f19 9991000非空值
f20 10000000非空值
f21 10000000非空值
f22 10000000非空值
f23 10000000非空值
f24 10000000非空值
f25 10000000非空值
f26 10000000非空值
f27 10000000非空值
f28 10000000非空值
f29 10000000非空值
f30 10000000非空值
f31 10000000非空值
f32 10000000非空值
f33 10000000非空值
i0 10000000非空值
i1 10000000非空值
i2 10000000非空值
i3 10000000非空值
i4 10000000非空值
i5 10000000非空值
i6 10000000非空值
i7 10000000非空值
i8 10000000非空值
i9 10000000非空值
i10 10000000非空值
i11 10000000非空值
i12 10000000非空值
i13 10000000非空值
i14 10000000非空值
i15 10000000非空值
i16 10000000非空val ues
i17 10000000非空值
i18 10000000非空值
dtypes:float64(34),int64(19)

时间(与您的机器规格相似)

 在[14 ]:%timeit df.mean()
1个循环,最好3:21.5 s每个循环

您可以通过预先转换为浮点数来获得2倍的加速(意味着这样做,但是这样做的方式更通用,所以更慢)

 在[15]中:%timeit df.astype('float64')。mean()
1个循环,每循环最好3:9.45 s

您的问题是对象列。 Mean将尝试为所有列进行计算,但是由于对象列的缘故,所有内容都被转换为 object dtype,这对于计算效率不高。



最好的选择

  df._get_numeric_data()。mean()

在较低的位置有一个选项 numeric_only 级别,但由于某些原因,我们不通过顶级功能(例如,均值)直接支持此功能。我认为添加此参数会产生问题。但是,默认情况下概率为 False (不排除)。


I have a rather large CSV file, it contains 9917530 rows (without the header), and 54 columns. Columns are real or integer, only one contains dates. There is a few NULL values on the file, which are translated to nan after I load it to pandas DataFrame, which I do like this:

import pandas as pd
data = pd.read_csv('data.csv')

After loading, which I think was very fast, cause it took around 30 seconds (pretty much the same time as counting lines with the Unix tool wc), the process was taking around 4Gb of RAM (the size of of the file on disk: 2.2 Gb. So far so good.

Then I tried to do the following:

column_means = data.mean()

The process' occupied memory grew to ~22Gb very quickly. I could also see the processor (one core) was very very busy - for like three hours, after that I killed the process, cause I needed to use the machine for other things. I have a pretty fast PC with Linux - it has 2 processors, each having 4 cores, so it's 8 cores all together, and 32 Gb of RAM. I cannot believe calculating column means should take so long.

Can anybody explain why DataFrame.mean() is so slow? And more importantly, what is a better way of calculating means of columns of a file like that? Did I not load the file the best way possible, should I use a different function instead of DataFrame.mean() or perhaps a completely different tool?

Many thanks in advance.

EDIT. Here is what df.info() shows:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9917530 entries, 0 to 9917529
Data columns (total 54 columns):
srch_id                        9917530  non-null values
date_time                      9917530  non-null values
site_id                        9917530  non-null values
visitor_location_country_id    9917530  non-null values
visitor_hist_starrating        505297  non-null values
visitor_hist_adr_usd           507612  non-null values
prop_country_id                9917530  non-null values
prop_id                        9917530  non-null values
prop_starrating                9917530  non-null values
prop_review_score              9902900  non-null values
prop_brand_bool                9917530  non-null values
prop_location_score1           9917530  non-null values
prop_location_score2           7739150  non-null values
prop_log_historical_price      9917530  non-null values
position                       9917530  non-null values
price_usd                      9917530  non-null values
promotion_flag                 9917530  non-null values
srch_destination_id            9917530  non-null values
srch_length_of_stay            9917530  non-null values
srch_booking_window            9917530  non-null values
srch_adults_count              9917530  non-null values
srch_children_count            9917530  non-null values
srch_room_count                9917530  non-null values
srch_saturday_night_bool       9917530  non-null values
srch_query_affinity_score      635564  non-null values
orig_destination_distance      6701069  non-null values
random_bool                    9917530  non-null values
comp1_rate                     235806  non-null values
comp1_inv                      254433  non-null values
comp1_rate_percent_diff        184907  non-null values
comp2_rate                     4040633  non-null values
comp2_inv                      4251538  non-null values
comp2_rate_percent_diff        1109847  non-null values
comp3_rate                     3059273  non-null values
comp3_inv                      3292221  non-null values
comp3_rate_percent_diff        944007  non-null values
comp4_rate                     620099  non-null values
comp4_inv                      692471  non-null values
comp4_rate_percent_diff        264213  non-null values
comp5_rate                     4444294  non-null values
comp5_inv                      4720833  non-null values
comp5_rate_percent_diff        1681006  non-null values
comp6_rate                     482487  non-null values
comp6_inv                      524145  non-null values
comp6_rate_percent_diff        193312  non-null values
comp7_rate                     631077  non-null values
comp7_inv                      713175  non-null values
comp7_rate_percent_diff        277838  non-null values
comp8_rate                     3819043  non-null values
comp8_inv                      3960388  non-null values
comp8_rate_percent_diff        1225707  non-null values
click_bool                     9917530  non-null values
gross_bookings_usd             276592  non-null values
booking_bool                   9917530  non-null values
dtypes: float64(34), int64(19), object(1)None

解决方案

Here's a similar sized from , but without an object column

In [10]: nrows = 10000000

In [11]: df = pd.concat([DataFrame(randn(int(nrows),34),columns=[ 'f%s' % i for i in range(34) ]),DataFrame(randint(0,10,size=int(nrows*19)).reshape(int(nrows),19),columns=[ 'i%s' % i for i in range(19) ])],axis=1)

In [12]: df.iloc[1000:10000,0:20] = np.nan

In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Data columns (total 53 columns):
f0     9991000  non-null values
f1     9991000  non-null values
f2     9991000  non-null values
f3     9991000  non-null values
f4     9991000  non-null values
f5     9991000  non-null values
f6     9991000  non-null values
f7     9991000  non-null values
f8     9991000  non-null values
f9     9991000  non-null values
f10    9991000  non-null values
f11    9991000  non-null values
f12    9991000  non-null values
f13    9991000  non-null values
f14    9991000  non-null values
f15    9991000  non-null values
f16    9991000  non-null values
f17    9991000  non-null values
f18    9991000  non-null values
f19    9991000  non-null values
f20    10000000  non-null values
f21    10000000  non-null values
f22    10000000  non-null values
f23    10000000  non-null values
f24    10000000  non-null values
f25    10000000  non-null values
f26    10000000  non-null values
f27    10000000  non-null values
f28    10000000  non-null values
f29    10000000  non-null values
f30    10000000  non-null values
f31    10000000  non-null values
f32    10000000  non-null values
f33    10000000  non-null values
i0     10000000  non-null values
i1     10000000  non-null values
i2     10000000  non-null values
i3     10000000  non-null values
i4     10000000  non-null values
i5     10000000  non-null values
i6     10000000  non-null values
i7     10000000  non-null values
i8     10000000  non-null values
i9     10000000  non-null values
i10    10000000  non-null values
i11    10000000  non-null values
i12    10000000  non-null values
i13    10000000  non-null values
i14    10000000  non-null values
i15    10000000  non-null values
i16    10000000  non-null values
i17    10000000  non-null values
i18    10000000  non-null values
dtypes: float64(34), int64(19)

Timings (similar machine specs to you)

In [14]: %timeit df.mean()
1 loops, best of 3: 21.5 s per loop

You can get a 2x speedup by pre-converting to floats (mean does this, but does it in a more general way, so slower)

In [15]: %timeit df.astype('float64').mean()
1 loops, best of 3: 9.45 s per loop

You problem is the object column. Mean will try to calculate for all of the columns, but because of the object column everything is upcast to object dtype which is not efficient for calculating.

Best bet is to do

 df._get_numeric_data().mean()

There is an option to do this numeric_only, at the lower level, but for some reason we don't directly support this via the top-level functions (e.g. mean). I think will create an issue to add this parameter. However will prob be False by default (to not-exclude).

这篇关于 pandas :DataFrame.mean()非常慢。如何更快地计算列均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆