如何在 pandas 中导入数字编码的列? [英] How do you import a numerically encoded column in pandas?

查看:67
本文介绍了如何在 pandas 中导入数字编码的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在导入一个数据集,该数据集对数字变量进行了数字编码,例如:

I'm importing a dataset which encodes a number of variables numerically, e.g.:

SEX
1 - Male
2 - Female

关于如何将它们转换(因此它们在我的数据框中显示为MaleFemale而不是数字)的最佳猜测是做这样的事情:

My best guess at how to convert these (so they appear in my dataframe as Male and Female instead of numbers) is to do something like this:

df.SEX.replace({1: 'Male', 2: 'Female'}, inplace=True)

是否有更好/更标准的方法来执行此操作(理想情况下,这是我对pd.read_fwf的调用的一部分,或者作为许多列的单个函数)-这是一个相当大的文件,并且我有很多列需​​要重新用这种方式编码.

Is there a better/more standard way to do this (ideally as part of my call to pd.read_fwf or as single function for many columns)--it's a fairly large file and I have a lot of columns to re-encode this way.

推荐答案

您可以使用

You can use categories for this:

df = pd.DataFrame({"Sex": [1, 2, 1, 1, 2, 1, 2]})

更改dtype:

df["Sex"] = df["Sex"].astype("category")
print(df["Sex"])
Out[33]: 
0    1
1    2
2    1
3    1
4    2
5    1
6    2
Name: Sex, dtype: category
Categories (2, int64): [1, 2]

重命名类别:

df["Sex"] = df["Sex"].cat.rename_categories(["Male", "Female"])
print(df)
Out[36]: 
      Sex
0    Male
1  Female
2    Male
3    Male
4  Female
5    Male
6  Female

我在大约75k的数据集上进行了尝试(选择啤酒评论数据集)

I tried it on a ~75k dataset (choosing the most reviewed 30 beers from beer reviews dataset)

rep_dict = dict(zip(df.beer_name.unique(), range(len(df.beer_name.unique())))) #it constructs a dictionary where the beer names are assigned a number from 0 to 29.

replace相当慢:

%timeit df["beer_name"].replace(rep_dict)
10 loops, best of 3: 139 ms per loop

map的速度比预期的要快(因为它会寻找完全匹配的内容):

map is faster as expected (because it looks for the exact matching):

%timeit df["beer_name"].map(rep_dict)
100 loops, best of 3: 2.78 ms per loop

更改列的类别所花费的时间几乎与map一样:

Changing the category of a column takes almost as much as map:

%timeit df["beer_name"].astype("category")
100 loops, best of 3: 2.57 ms per loop

但是,更改后,类别重命名的速度更快:

However, after changing, category renames are way faster:

df["beer_name"] = df["beer_name"].astype("category")
%timeit df["beer_name"].cat.rename_categories(range(30))
10000 loops, best of 3: 149 µs per loop

因此,第二个map将花费与第一个map一样多的时间,但是一旦更改类别,rename_categories将更快. 很遗憾,在读取文件时无法分配category dtype.您需要随后更改类型.

So, a second map would take as much time as the first map but once you change the category, rename_categories will be faster. Unfortunately, category dtype cannot be assigned while reading the file. You need to change the types afterwards.

从0.19.0版本开始,您可以将dtype='category'传递给read_csv(或使用字典指定将哪些列解析为类别). (文档)

As of version 0.19.0, you can pass dtype='category' to read_csv (or specify which columns to be parsed as categories with a dictionary). (docs)

这篇关于如何在 pandas 中导入数字编码的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆