如何根据相同的伪向量对两个 pandas 数据帧进行编码? [英] How to encode two Pandas dataframes according to the same dummy vectors?

查看:72
本文介绍了如何根据相同的伪向量对两个 pandas 数据帧进行编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将分类值编码为伪向量. pandas.get_dummies做得很好,但是伪向量取决于Dataframe中存在的值.如何根据与第一个数据帧相同的伪矢量对第二个数据帧进行编码?

I'm trying to encode categorical values to dummy vectors. pandas.get_dummies does a perfect job, but the dummy vectors depend on the values present in the Dataframe. How to encode a second Dataframe according to the same dummy vectors as the first Dataframe?

 import pandas as pd


df=pd.DataFrame({'cat1':['A','N','K','P'],'cat2':['C','S','T','B']})
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(b)



  cat1_A  cat1_K  cat1_N  cat1_P
0       1       0       0       0
1       0       0       1       0
2       0       1       0       0
3       0       0       0       1



df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)

   cat1_A  cat1_N
0       1       0
1       0       1

如何获得此输出?

 cat1_A  cat1_K  cat1_N  cat1_P
0       1       0       0       0
1       0       0       1       0

我当时在考虑手动计算每一列的唯一性,然后创建一个字典来映射第二个Dataframe,但是我确定已经有一个函数了…… 谢谢!

I was thinking to manually compute uniques for each column and then create a dictionary to map the second Dataframe, but I'm sure there is already a function for that... Thanks!

推荐答案

我之前遇到过同样的问题.这是我所做的,但这不一定是实现此目的的最佳方法.但这对我有用.

I had the same problem before. This is what I did which is not necessary the best way to do this. But this works for me.

df=pd.DataFrame({'cat1':['A','N'],'cat2':['C','S']})

df['cat1'] = df['cat1'].astype('category', categories=['A','N','K','P'])
# then run the get_dummies
b=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')

使用带有作为参数传递的类别"值的函数类型.

Using the function astype with 'categories' values passed in as parameter.

要将相同的类别应用于所有DF,最好将类别值存储到变量中,例如

To apply the same category to all DFs, you better store the category values to a variable like

cat1_categories = ['A','N','K','P']
cat2_categories = ['C','S','T','B']

然后使用类似的类型

df_test=df=pd.DataFrame({'cat1':['A','N',],'cat2':['T','B']})
df['cat1'] = df['cat1'].astype('category', categories=cat1_categories)
c=pd.get_dummies(df['cat1'],prefix='cat1').astype('int')
print(c)

   cat1_A  cat1_N  cat1_K  cat1_P
0       1       0       0       0
1       0       1       0       0

这篇关于如何根据相同的伪向量对两个 pandas 数据帧进行编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆