相当于Stata编码的 pandas [英] pandas equivalent of Stata's encode

查看:101
本文介绍了相当于Stata编码的 pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来复制Stata中的编码行为会将分类字符串列转换为数字列.

I'm looking for a way to replicate the encode behaviour in Stata, which will convert a categorical string column into a number column.

x = pd.DataFrame({'cat':['A','A','B'], 'val':[10,20,30]})
x = x.set_index('cat')

这将导致:

     val
cat     
A     10
A     20
B     30

我想将cat列从字符串转换为整数,将每个唯一字符串一对一映射到(任意)整数.这将导致:

I'd like to convert the cat column from strings to integers, mapping each unique string to an (arbitrary) integer 1-to-1. It would result in:

     val
cat     
1     10
1     20
2     30

或者,同样好:

  cat  val
0   1   10
1   1   20
2   2   30

有什么建议吗?

非常感谢, 罗布

推荐答案

Stata的encode命令以字符串变量开头,并创建一个新的整数变量,其标签映射到原始字符串变量.现在,这在大熊猫中的直接类比就是分类变量类型,它从0.15(在最初询问并回答此问题后发布)开始成为大熊猫的完整部分.

Stata's encode command starts with a string variable and creates a new integer variable with labels mapped to the original string variable. The direct analog of this in pandas would now be the categorical variable type which became a full-fledged part of pandas starting in 0.15 (which was released after this question was originally asked and answered).

请参见文档此处.

为演示此示例,Stata命令将类似于:

To demonstrate for this example, the Stata command would be something like:

encode cat, generate(cat2)

而pandas命令为:

whereas the pandas command would be:

x['cat2'] = x['cat'].astype('category')

  cat  val cat2
0   A   10    A
1   A   20    A
2   B   30    B

就像Stata使用encode一样,数据存储为整数,但在默认输出中显示为字符串.

Just as Stata does with encode, the data are stored as integers, but display as strings in the default output.

您可以通过使用分类访问器cat来查看基础整数来验证这一点. (因此,您可能不想使用"cat"作为列名.)

You can verify this by using the categorical accessor cat to see the underlying integer. (And for that reason you probably don't want to use 'cat' as a column name.)

x['cat2'].cat.codes

0    0
1    0
2    1

这篇关于相当于Stata编码的 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆