替换类别数据( pandas ) [英] Replacing Category Data (pandas)

查看:72
本文介绍了替换类别数据( pandas )的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些带有多个类别列的大文件.类别也很慷慨,因为它们基本上是描述/部分句子.

I have some large files with several category columns. Category is kind of a generous word too because these are basically descriptions/partial sentences.

以下是每个类别的唯一值:

Here are the unique values per category:

Category 1 = 15 
Category 2 = 94
Category 3 = 294
Category 4 = 401

Location 1 = 30
Location 2 = 60 

然后,甚至还有具有重复数据(名字,姓氏,ID等)的用户.

Then there are even users with recurring data (first name, last name, IDs etc).

我正在考虑以下解决方案,以减小文件大小:

I was thinking of the following solutions to make the file size smaller:

1)创建一个文件,该文件将每个类别与唯一的整数匹配

1) Create a file which matches each category with an unique integer

2)创建地图(是否有办法通过读取另一个文件来做到这一点?就像我将创建一个.csv并将其加载为另一个数据框然后进行匹配一样吗;还是我必须首先将其键入? )

2) Create a map (is there a way to do this from reading another file? Like I would create a .csv and load it as another dataframe and then match it? Or do I literally have to type it out initially?)

OR

3)基本上进行联接(VLOOKUP),然后使用长对象名删除旧列

3) Basically do a join (VLOOKUP) and then del the old column with the long object names

pd.merge(df1, categories, on = 'Category1', how = 'left') 
del df1['Category1']

在这种情况下,人们通常会做什么?这些文件非常大. 60列,大多数数据很长,重复了类别和时间戳.从字面上看,根本没有数值数据.对我来说很好,但是由于共享驱动器空间已分配了几个月的时间,因此几乎不可能共享文件.

What do people normally do in this case? These files are pretty huge. 60 columns and most of the data are long, repeating categories and timestamps. Literally no numerical data at all. It's fine for me, but sharing the files is almost impossible due to shared drive space allocations for more than a few months.

推荐答案

要在保存到csv时受益于Categorical dtype,您可能需要遵循以下过程:

To benefit from Categorical dtype when saving to csv you might want to follow this process:

  1. 将类别定义提取到单独的数据框/文件中
  2. 将分类数据转换为int代码
  3. 将转换后的DataFrame与定义数据帧一起保存到csv

当您需要再次使用它们时:

When you need to use them again:

  1. 从csv文件恢复数据帧
  2. 将带有int代码的数据框映射到类别定义
  3. 将映射的列转换为分类的

说明过程:

制作示例数据框:

df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()

    <class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null object
locations     100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None

请注意大小:2.3+ MB-这大约是您的csv文件的大小. 现在将这些数据转换为Categorical:

Note the size: 2.3+ MB - this would be roughly the size of your csv file. Now convert these data to Categorical:

df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null category
locations     100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None

请注意内存使用量下降到976.6 KB 但是,如果您现在将其保存到csv中:

Note the drop in memory usage down to 976.6 KB But if you would save it to csv now:

df.to_csv('test1.csv')

...您将在文件内看到它:

...you would see this inside the file:

index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2

这意味着类别"已转换为字符串以保存在csv中. 因此,在保存定义之后,让我们摆脱Categorical数据中的标签:

Which means 'Categorical' has been converted to strings for saving in csv. So let's get rid of the labels in Categorical data after we save the definitions:

categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details

      categories
index           
0      Category1
1      Category2
2      Category3
3      Category4

locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details

       index           
0      Location1
1      Location2

现在将秘密Categorical转换为int dtype:

Now covert Categorical to int dtype:

for col in df.select_dtypes(include=['category']).columns:
    df[col] = df[col].cat.codes
print df.head()

       categories  locations
index                       
0               0          0
1               1          1
2               2          0
3               3          1
4               0          0

print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null int8
locations     100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None

将转换后的数据保存到csv,请注意,该文件现在只有数字,没有标签. 文件大小也将反映此更改.

Save converted data to csv and note that the file now has only numbers without labels. The file size will also reflect this change.

df.to_csv('test2.csv')

index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1

还要保存定义:

categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')

当您需要还原文件时,请从csv个文件中加载它们:

When you need to restore the files, load them from csv files:

df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()

       categories  locations
index                       
0               0          0
1               1          1
2               2          0
3               3          1
4               0          0

print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null int64
locations     100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None

categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()

      categories
index           
0      Category1
1      Category2
2      Category3
3      Category4

print categories_details2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None

locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()

       locations
index           
0      Location1
1      Location2

print locations_details2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations    2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None

现在使用map用类别描述替换int编码数据并将其转换为Categorical:

Now use map to replace int coded data with categories descriptions and convert them to Categorical:

df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()

      categories  locations
index                      
0      Category1  Location1
1      Category2  Location2
2      Category3  Location1
3      Category4  Location2
4      Category1  Location1

print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null category
locations     100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None

请注意,将内存使用率恢复为首次将数据转换为Categorical时的使用率. 如果您需要重复很多次,可以很容易地自动化该过程.

Note the memory usage back to what it was when we first converted data to Categorical. It should not be hard to automate this process if you need to repeat it many time.

这篇关于替换类别数据( pandas )的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆