pandas 可变数量的列到二进制矩阵 [英] Pandas variable numbers of columns to binary matrix

查看:101
本文介绍了 pandas 可变数量的列到二进制矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在使用没有固定列数的数据集(csv文件).但是,我想将其转换为具有固定列数的二进制矩阵.

I am currently working with a data set(csv file) which doesn't have a fixed number of columns. however, I want to convert it to a binary matrix which have a fixed number of columns.

例如,当前数据集是这样的(没有标题),

as an example, current data set is like this(no headers),

a,b,x,z,y
b,e,w,t,u,o,s,z,i
z,o,w
o,p,w,z,a

我希望将其转换为以下内容(第一行是标题)

I want this to be converted to below (first row is the header)

a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z

1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1
0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1

我遇到的主要问题是数据集中的列数不同.我正在考虑的伪代码或逻辑就是这个

The main problem I am experiencing is the varied number of columns in data set. The pseudo code or the logic I was considering is this

header=[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]
data_frame=csv file
df=new data frame
for each row in data_frame:
      for each item in row:
          create pandas Series 
          if item in header:
             append '1' to Series
          else:
             append '0' to Series
       append series to df

最后,矩阵应写入另一个csv文件中.

Finally the matrix should be written in to another csv file.

我对python有一定的了解,但对熊猫没有了解.因此,我恳请某人在这一点上给我一些帮助,因为我似乎无法找到一种方法来做到这一点.谢谢!

I have fair knowledge in python but not in pandas. Therefore, I am kindly asking someone to give me some help with this point as I cant seems to be find a way to do this. Thank You!

推荐答案

这是使用pd.get_dummies()的一种方法.

import pandas as pd

# read your csv data, separate must not be ',', for example, set tab `\t`
# =======================================================================
# I just read from clipboard
df = pd.read_clipboard(header=None, sep='\t')

df
                   0
0          a,b,x,z,y
1  b,e,w,t,u,o,s,z,i
2              z,o,w
3          o,p,w,z,a

# step 1
# =========================
df1 = df.groupby(level=0).apply(lambda group: pd.Series(group.values.ravel().tolist()[0].split(',')))

df1

0  0    a
   1    b
   2    x
   3    z
   4    y
1  0    b
   1    e
   2    w
   3    t
   4    u
       ..
   7    z
   8    i
2  0    z
   1    o
   2    w
3  0    o
   1    p
   2    w
   3    z
   4    a
dtype: object


# step 2
# =========================
pd.get_dummies(df1).groupby(level=0).agg(max)

   a  b  e ...  x  y  z
0  1  1  0 ...  1  1  1
1  0  1  1 ...  0  0  1
2  0  0  0 ...  0  0  1
3  1  0  0 ...  0  0  1

[4 rows x 13 columns]

# step 3, to_csv()
# =========================

这篇关于 pandas 可变数量的列到二进制矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆