如何在 pandas 中导入数字编码的列? [英] How do you import a numerically encoded column in pandas?
问题描述
我正在导入一个数据集,该数据集对数字变量进行了数字编码,例如:
I'm importing a dataset which encodes a number of variables numerically, e.g.:
SEX
1 - Male
2 - Female
关于如何将它们转换(因此它们在我的数据框中显示为Male
和Female
而不是数字)的最佳猜测是做这样的事情:
My best guess at how to convert these (so they appear in my dataframe as Male
and Female
instead of numbers) is to do something like this:
df.SEX.replace({1: 'Male', 2: 'Female'}, inplace=True)
是否有更好/更标准的方法来执行此操作(理想情况下,这是我对pd.read_fwf
的调用的一部分,或者作为许多列的单个函数)-这是一个相当大的文件,并且我有很多列需要重新用这种方式编码.
Is there a better/more standard way to do this (ideally as part of my call to pd.read_fwf
or as single function for many columns)--it's a fairly large file and I have a lot of columns to re-encode this way.
推荐答案
You can use categories
for this:
df = pd.DataFrame({"Sex": [1, 2, 1, 1, 2, 1, 2]})
更改dtype:
df["Sex"] = df["Sex"].astype("category")
print(df["Sex"])
Out[33]:
0 1
1 2
2 1
3 1
4 2
5 1
6 2
Name: Sex, dtype: category
Categories (2, int64): [1, 2]
重命名类别:
df["Sex"] = df["Sex"].cat.rename_categories(["Male", "Female"])
print(df)
Out[36]:
Sex
0 Male
1 Female
2 Male
3 Male
4 Female
5 Male
6 Female
我在大约75k的数据集上进行了尝试(选择啤酒评论数据集)
I tried it on a ~75k dataset (choosing the most reviewed 30 beers from beer reviews dataset)
rep_dict = dict(zip(df.beer_name.unique(), range(len(df.beer_name.unique())))) #it constructs a dictionary where the beer names are assigned a number from 0 to 29.
replace
相当慢:
%timeit df["beer_name"].replace(rep_dict)
10 loops, best of 3: 139 ms per loop
map
的速度比预期的要快(因为它会寻找完全匹配的内容):
map
is faster as expected (because it looks for the exact matching):
%timeit df["beer_name"].map(rep_dict)
100 loops, best of 3: 2.78 ms per loop
更改列的类别所花费的时间几乎与map
一样:
Changing the category of a column takes almost as much as map
:
%timeit df["beer_name"].astype("category")
100 loops, best of 3: 2.57 ms per loop
但是,更改后,类别重命名的速度更快:
However, after changing, category renames are way faster:
df["beer_name"] = df["beer_name"].astype("category")
%timeit df["beer_name"].cat.rename_categories(range(30))
10000 loops, best of 3: 149 µs per loop
因此,第二个map
将花费与第一个map
一样多的时间,但是一旦更改类别,rename_categories
将更快. 很遗憾,在读取文件时无法分配 category
dtype.您需要随后更改类型.
So, a second map
would take as much time as the first map
but once you change the category, rename_categories
will be faster. Unfortunately, category
dtype cannot be assigned while reading the file. You need to change the types afterwards.
从0.19.0版本开始,您可以将dtype='category'
传递给read_csv(或使用字典指定将哪些列解析为类别). (文档)
As of version 0.19.0, you can pass dtype='category'
to read_csv (or specify which columns to be parsed as categories with a dictionary). (docs)
这篇关于如何在 pandas 中导入数字编码的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!