pandas 可变数量的列到二进制矩阵 [英] Pandas variable numbers of columns to binary matrix
问题描述
我当前正在使用没有固定列数的数据集(csv文件).但是,我想将其转换为具有固定列数的二进制矩阵.
I am currently working with a data set(csv file) which doesn't have a fixed number of columns. however, I want to convert it to a binary matrix which have a fixed number of columns.
例如,当前数据集是这样的(没有标题),
as an example, current data set is like this(no headers),
a,b,x,z,y
b,e,w,t,u,o,s,z,i
z,o,w
o,p,w,z,a
我希望将其转换为以下内容(第一行是标题)
I want this to be converted to below (first row is the header)
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1
0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,0,1,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1
我遇到的主要问题是数据集中的列数不同.我正在考虑的伪代码或逻辑就是这个
The main problem I am experiencing is the varied number of columns in data set. The pseudo code or the logic I was considering is this
header=[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z]
data_frame=csv file
df=new data frame
for each row in data_frame:
for each item in row:
create pandas Series
if item in header:
append '1' to Series
else:
append '0' to Series
append series to df
最后,矩阵应写入另一个csv文件中.
Finally the matrix should be written in to another csv file.
我对python有一定的了解,但对熊猫没有了解.因此,我恳请某人在这一点上给我一些帮助,因为我似乎无法找到一种方法来做到这一点.谢谢!
I have fair knowledge in python but not in pandas. Therefore, I am kindly asking someone to give me some help with this point as I cant seems to be find a way to do this. Thank You!
推荐答案
这是使用pd.get_dummies()
的一种方法.
import pandas as pd
# read your csv data, separate must not be ',', for example, set tab `\t`
# =======================================================================
# I just read from clipboard
df = pd.read_clipboard(header=None, sep='\t')
df
0
0 a,b,x,z,y
1 b,e,w,t,u,o,s,z,i
2 z,o,w
3 o,p,w,z,a
# step 1
# =========================
df1 = df.groupby(level=0).apply(lambda group: pd.Series(group.values.ravel().tolist()[0].split(',')))
df1
0 0 a
1 b
2 x
3 z
4 y
1 0 b
1 e
2 w
3 t
4 u
..
7 z
8 i
2 0 z
1 o
2 w
3 0 o
1 p
2 w
3 z
4 a
dtype: object
# step 2
# =========================
pd.get_dummies(df1).groupby(level=0).agg(max)
a b e ... x y z
0 1 1 0 ... 1 1 1
1 0 1 1 ... 0 0 1
2 0 0 0 ... 0 0 1
3 1 0 0 ... 0 0 1
[4 rows x 13 columns]
# step 3, to_csv()
# =========================
这篇关于 pandas 可变数量的列到二进制矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!