有效地扩展浮点数的pandas字符串列 [英] Expanding pandas string column of floats memory-efficiently
问题描述
我有一个这样的DataFrame:
I have a DataFrame such as this:
df = pd.DataFrame([['Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]']],
columns=['col1', 'col2','values'])
唯一的区别是我有几百万行,列values
是每行正好有200个浮点数的字符串,而不是我的示例中的4个浮点数.
The only differences are I have a few million rows and column values
is a string of exactly 200 floats in each row, instead of 4 in my example.
包含此数据的csv文件约为5 GB.但是,当我将前两个字符串列转换为类别后加载到熊猫中时,这会减少.因此,我能够执行大多数操作(过滤,切片,索引),而不会出现性能问题.
The csv file containing this data is ~5 GB. However, this reduces when I load into pandas after converting the first 2 string columns into categories. Hence I am able to perform most manipulations (filtering, slicing, indexing) with no performance issues.
我需要将字符串的values
列扩展为浮点数的单独列.因此,将有200列,每列包含一个浮点数.我尝试执行此操作,但是我一直都用光内存.从理论上讲,我认为这应该以一种内存高效的方式逐行实现,因为浮点数列所占用的内存应少于字符串中的许多数字.有什么好的算法?
I need to expand the values
column of strings into separate columns of floats. So there will be 200 columns each containing a float. I made an attempt at performing this, but I consistently run out of memory. Theoretically, I think this should be possible line by line in a memory efficient way, since columns of floats should take less memory than many numbers in a string. What's a good algorithm for this?
下面是我现有的代码,用于拆分values
列.
My existing code is below for splitting values
column.
df['values'] = df['values'].str.replace('[','').str.replace(']','')
# code runs out of memory in next line!
df_values = pd.DataFrame([x.split(',') for x in df['values'].values.tolist()])
df_values[df_values.columns] = df_values[df_values.columns].apply(pd.to_numeric, errors='coerce')
df_values[df_values.columns] = df_values[df_values.columns].fillna(0.0)
df= df.drop('values', 1).join(df_values)
我的示例的预期结果,上面的代码可针对少量的行正确生成该结果:
Expected result for my sample, which above code generates correctly for small number of rows:
df = pd.DataFrame([['Col1Val', 'Col2Val', 3.0, 31.1, -341.4, 54.13]],
columns=['col1', 'col2', 0, 1, 2, 3])
为说明我希望(希望!)减少内存"的原因,floats
通常应该比string
占用更少的空间:
To labour my reasoning for why I'm hoping (wishing!) for a "memory decreasing" solution, floats
should normally take less space than string
:
from sys import getsizeof
getsizeof('334.34') #55
getsizeof(334.34) #24
getsizeof('-452.35614') #59
getsizeof(-452.35614) #24
推荐答案
对于较小的数据集:(如果由于内存问题而导致此方法失败,请参见下文).
您也可以尝试这个.
For smaller datasets: (see below if this method fails due to memory issue.)
You can also try this.
df['values'].str[1:-1].str.split(",", expand=True).astype(float)
第一个str[1:-1]
操作将除去括号.
The first str[1:-1]
operation removes the brackets.
str.split
将用,
拆分其余的值,并将其扩展为一个数据帧(使用expand=True
)
str.split
will split the rest of the values by ,
and expand it into a dataframe (with the expand=True
)
0 1 2 3
0 3.0 31.1 -341.4 54.13
您还可以按[ , ]
df['values'].str.split(r"[\[,\]]", expand=True).astype(float)
但这将导致多出两列
0 1 2 3 4 5
0 3 31.1 -341.4 54.13
(用于大型数据集.)
一个人可能会尝试从读取数据部分修复它.
(For large dataset.)
One might try to fix it from the reading data part.
df = pd.read_csv('test.csv', delimiter=',', quotechar='"')
在这里,我们将引号char更改为"
,这样原始引号char '
将被忽略.然后,我们将其除以,
.然后,我们将需要进行一些数据预处理以修复错误解析的部分.
Here, we change the quote char to "
such that the original quote char '
will be ignored. We then just split by ,
. Then, we will need to do some data preprocessing to fix the misparsed part.
给出我的test.csv
存在
c1,c2,v1,v2,v3,v4
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
'Col1Val', 'Col2Val', '[3, 31.1, -341.4, 54.13]'
read_csv
的结果是
c1 c2 v1 v2 v3 v4
0 'Col1Val' 'Col2Val' '[3 31.1 -341.4 54.13]'
1 'Col1Val' 'Col2Val' '[3 31.1 -341.4 54.13]'
2 'Col1Val' 'Col2Val' '[3 31.1 -341.4 54.13]'
现在,我们可以使用某些 str
方法修复每一列. 注意:如果c1
/c2
中有逗号,则结果将是错误的.
Now, we can use some str
methods to fix each column. Note: if there is comma in c1
/c2
, the results would be wrong.
这篇关于有效地扩展浮点数的pandas字符串列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!