pandas 和类别替换 [英] pandas and category replacement

查看:80
本文介绍了 pandas 和类别替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过将较短的类别值替换为较长的字段来减少约300个csv文件的大小(约10亿行).

I'm trying to reduce the size of ~300 csv files (about a billion rows) by replacing lengthy fields with shorter, categorical, values.

我正在使用熊猫,并且遍历了每个文件以构建一个数组,该数组包含要替换的唯一值的 all .我不能只在每个文件上单独使用pandas.factorize,因为我需要(例如)"3001958145"来映射到file1.csv和file244.csv上的相同值.我已经创建了一个数组,我想通过创建另一个增量整数数组来替换这些值 .

I'm making use of pandas, and I've iterated through each of the files to build an array that includes all of the unique values I'm trying to replace. I can't individually just use pandas.factorize on each file, because I need (for example) '3001958145' to map to the same value on file1.csv as well as file244.csv. I've created an array of what I'd like to replace these values with just by creating another array of incremented integers.

In [1]: toreplace = data['col1'].unique()
Out[1]: array([1000339602, 1000339606, 1000339626, ..., 3001958145, 3001958397,
   3001958547], dtype=int64)

In [2]: replacewith = range(0,len(data['col1'].unique()))
Out[2]: [0, 1, 2,...]

现在,我该如何有效地在我的"replacewith"变量中为需要迭代的每个文件中的每个"toreplace"值进行交换?

Now, how do I go about efficiently swapping in my 'replacewith' variable for each corresponding 'toreplace' value for each of the files I need to iterate through?

与处理类别一样,pandas具有与熊猫一样的能力,我想那里有一种可以完成我只是找不到的方法.我为此编写的函数有效(它依赖于pandas.factorized输入,而不是我上面描述的排列方式),但是它依赖于replace函数并遍历整个序列,因此相当慢.

With as capable as pandas is with dealing with categories, I assume there has to be a method out there that can accomplish this that I simply just can't find. The function I wrote to do this works (it relies on a pandas.factorized input rather than the arrangement I described above), but it relies on the replace function and iterating through the series so it's quite slow.

def powerreplace(pdseries,factorized):
  i = 0
  for unique in pdseries.unique():
    print '%i/%i' % (i,len(pdseries.unique()))
    i=i+1
    pdseries.replace(to_replace=unique,
                     value=np.where(factorized[1]==unique)[0][0],
                     inplace=True)

有人能推荐一种更好的方法吗?

Can anyone recommend a better way to go about doing this?

推荐答案

这至少需要熊猫0.15.0; (但是.astype语法在0.16.0中更友好一些,因此更好地使用它).这是分类文档

This requires at least pandas 0.15.0; (however the .astype syntax is a bit more friendly in 0.16.0, so better to use that). Here are the docs for categoricals

进口

In [101]: import pandas as pd
In [102]: import string
In [103]: import numpy as np    
In [104]: np.random.seed(1234)
In [105]: pd.set_option('max_rows',10)

创建样本集以创建一些数据

Create a sample set to create some data

In [106]: uniques = np.array(list(string.ascii_letters))
In [107]: len(uniques)
Out[107]: 52

创建一些数据

In [109]: df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})

In [110]: df1.head()
Out[110]: 
   A
0  p
1  t
2  g
3  v
4  m

In [111]: df1.A.nunique()
Out[111]: 31

In [112]: df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})

In [113]: df2.head()
Out[113]: 
   A
0  I
1  j
2  b
3  A
4  m
In [114]: df2.A.nunique()
Out[114]: 52

因此,我们现在有2个框架要分类;第一帧恰好少于全部类别.这是故意的;您不必预先了解完整的设置.

So we now have 2 frames that we want to categorize; the first frame happens to have less than the full set of categories. This is on purpose; you don't have to know the complete set upfront.

将A列转换为属于分类的B列

Convert the A columns to B columns that are a Categorical

In [116]: df1['B'] = df1['A'].astype('category')

In [118]: i = df1['B'].cat.categories

In [124]: i
Out[124]: Index([u'A', u'B', u'C', u'D', u'E', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z'], dtype='object')

如果我们迭代处理这些帧,则使用第一个开始.为了获得每个连续的,我们将对称差与现有集合相加.这样可以使类别保持相同的顺序,因此在进行分解时,会得到相同的编号方案.

If we are iteratively processing these frames, we use the first ones to start. To get each successive one, we add the symmetric difference with the existing set. This keeps the categories in the same order, so when we factorize we get the same numbering scheme.

In [119]: cats = i.tolist() + i.sym_diff(df2['A'].astype('category').cat.categories).tolist()

我们现在恢复了原始设置

We have now gained back the original set

In [120]: (np.array(sorted(cats)) == sorted(uniques)).all()
Out[120]: True

将下一个帧B列设置为类别,但我们指定类别,因此当将其分解时,将使用相同的值

Set the next frames B column to be a categorical, BUT we specify the categories, so when it is factorized the same values are used

In [121]: df2['B'] = df2['A'].astype('category',categories=cats)

为了证明这一点,我们从每个代码中选择代码(分解图).这些代码匹配; df2还有一个附加代码(因为Z在第二帧,但不在第一帧).

To prove it, we select the codes (the factorized map) from each. These codes match; df2 has an additional code (as Z is in the 2nd frame but not the first).

In [122]: df1[df1['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[122]: array([30,  0,  5])

In [123]: df2[df2['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[123]: array([ 0, 30,  5, 51])

然后,您可以简单地存储代码来代替对象dtyped数据.

You can simply then store the codes in lieu of the object dtyped data.

请注意,由于这些类别是本地存储的,因此将它们序列化为HDF5实际上非常有效,请参见

Note that it is actually quite efficient to serialize these to HDF5 as Categoricals are natively stored, see here

请注意,我们正在创建一种非常节省内存的方式来存储此数据.注意在[154]中,object dtype的内存使用率实际上越高,字符串越长,因为这仅仅是指针的内存;实际值存储在堆中. [155]是已使用的所有内存.

Note that we are creating a pretty memory efficient way of storing this data. Noting that the memory usage of in [154], the object dtype is actually MUCH higher the longer the string gets because this is just the memory for a pointer; the actual values are stored on the heap. While [155] is ALL the memory used.

In [153]: df2.dtypes
Out[153]: 
A      object
B    category
dtype: object

In [154]: df2.A.to_frame().memory_usage()
Out[154]: 
A    8000000
dtype: int64

In [155]: df2.B.to_frame().memory_usage()
Out[155]: 
B    1000416
dtype: int64

这篇关于 pandas 和类别替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆