有效地扩展浮点数的pandas字符串列 [英] Expanding pandas string column of floats memory-efficiently

查看:80
本文介绍了有效地扩展浮点数的pandas字符串列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的DataFrame:

I have a DataFrame such as this:

df = pd.DataFrame([['Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]']],
                  columns=['col1', 'col2','values'])

唯一的区别是我有几百万行,列values是每行正好有200个浮点数的字符串,而不是我的示例中的4个浮点数.

The only differences are I have a few million rows and column values is a string of exactly 200 floats in each row, instead of 4 in my example.

包含此数据的csv文件约为5 GB.但是,当我将前两个字符串列转换为类别后加载到熊猫中时,这会减少.因此,我能够执行大多数操作(过滤,切片,索引),而不会出现性能问题.

The csv file containing this data is ~5 GB. However, this reduces when I load into pandas after converting the first 2 string columns into categories. Hence I am able to perform most manipulations (filtering, slicing, indexing) with no performance issues.

我需要将字符串的values列扩展为浮点数的单独列.因此,将有200列,每列包含一个浮点数.我尝试执行此操作,但是我一直都用光内存.从理论上讲,我认为这应该以一种内存高效的方式逐行实现,因为浮点数列所占用的内存应少于字符串中的许多数字.有什么好的算法?

I need to expand the values column of strings into separate columns of floats. So there will be 200 columns each containing a float. I made an attempt at performing this, but I consistently run out of memory. Theoretically, I think this should be possible line by line in a memory efficient way, since columns of floats should take less memory than many numbers in a string. What's a good algorithm for this?

下面是我现有的代码,用于拆分values列.

My existing code is below for splitting values column.

df['values'] = df['values'].str.replace('[','').str.replace(']','')

# code runs out of memory in next line!
df_values = pd.DataFrame([x.split(',') for x in df['values'].values.tolist()])

df_values[df_values.columns] = df_values[df_values.columns].apply(pd.to_numeric, errors='coerce')
df_values[df_values.columns] = df_values[df_values.columns].fillna(0.0)

df= df.drop('values', 1).join(df_values)

我的示例的预期结果,上面的代码可针对少量的行正确生成该结果:

Expected result for my sample, which above code generates correctly for small number of rows:

df = pd.DataFrame([['Col1Val', 'Col2Val', 3.0, 31.1, -341.4, 54.13]],
                  columns=['col1', 'col2', 0, 1, 2, 3])

为说明我希望(希望!)减少内存"的原因,floats通常应该比string占用更少的空间:

To labour my reasoning for why I'm hoping (wishing!) for a "memory decreasing" solution, floats should normally take less space than string:

from sys import getsizeof

getsizeof('334.34')      #55
getsizeof(334.34)        #24
getsizeof('-452.35614')  #59
getsizeof(-452.35614)    #24

推荐答案

对于较小的数据集:(如果由于内存问题而导致此方法失败,请参见下文).

您也可以尝试这个.

For smaller datasets: (see below if this method fails due to memory issue.)

You can also try this.

df['values'].str[1:-1].str.split(",", expand=True).astype(float)

第一个str[1:-1]操作将除去括号.

The first str[1:-1] operation removes the brackets.

str.split将用,拆分其余的值,并将其扩展为一个数据帧(使用expand=True)

str.split will split the rest of the values by , and expand it into a dataframe (with the expand=True)

    0       1       2       3
0   3.0     31.1    -341.4  54.13

您还可以按[ , ]

df['values'].str.split(r"[\[,\]]", expand=True).astype(float)

但这将导致多出两列

    0   1   2       3       4       5
0       3   31.1    -341.4  54.13   

(用于大型数据集.)

一个人可能会尝试从读取数据部分修复它.

(For large dataset.)

One might try to fix it from the reading data part.

df = pd.read_csv('test.csv', delimiter=',', quotechar='"')

在这里,我们将引号char更改为",这样原始引号char '将被忽略.然后,我们将其除以,.然后,我们将需要进行一些数据预处理以修复错误解析的部分.

Here, we change the quote char to " such that the original quote char ' will be ignored. We then just split by ,. Then, we will need to do some data preprocessing to fix the misparsed part.

给出我的test.csv存在

 c1,c2,v1,v2,v3,v4
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'

read_csv的结果是

    c1          c2          v1      v2      v3      v4
0   'Col1Val'   'Col2Val'   '[3     31.1    -341.4  54.13]'
1   'Col1Val'   'Col2Val'   '[3     31.1    -341.4  54.13]'
2   'Col1Val'   'Col2Val'   '[3     31.1    -341.4  54.13]'

现在,我们可以使用某些 str 方法修复每一列. 注意:如果c1/c2中有逗号,则结果将是错误的.

Now, we can use some str methods to fix each column. Note: if there is comma in c1/c2, the results would be wrong.

这篇关于有效地扩展浮点数的pandas字符串列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆