需要通过读取带有随机列的csv文件来创建Pandas数据框 [英] Need to create a Pandas dataframe by reading csv file with random columns
问题描述
我有以下带有记录的csv文件:
I have the following csv file with records:
- A 1,B 2,C 10,D 15
- A 5,D 10,G 2
- D 6,E 7
- H 7,G 8
- A 1, B 2, C 10, D 15
- A 5, D 10, G 2
- D 6, E 7
- H 7, G 8
我的列标题/名称为:A,B,C,D,E,F,G
My column headers/names are: A, B, C, D, E, F, G
因此,使用 read_csv后,我的初始数据帧变为:
So my initial dataframe after using "read_csv" becomes:
A B C D E F G
A 1 B 2 C 10 D 15 NaN NaN NaN
A 5 D 10 G 2 NaN NaN NaN NaN
D 6 E 7 NaN NaN NaN NaN NaN
H 7 G 8 NaN NaN NaN NaN Nan
该值可以分为[column name] [column value],因此A 1表示col = A且value = 1 ,而D 15表示col = D且value = 15,依此类推...
The value can be separate into [column name][column value], so A 1 means col=A and value=1, and D 15 means col=D and value=15, etc...
我想要的是基于$ b将数值分配给适当的列$ b并具有一个如下所示的数据框:
What I want is to assign the numeric value to the appropriate column based on the and have a dataframe that looks like this:
A B C D E F G
A 1 B 2 C 10 D 15 NaN NaN NaN
A 5 Nan NaN D 10 NaN NaN G 2
NaN NaN NaN D 6 E 7 NaN NaN
NaN NaN NaN NaN NaN NaN G 8
甚至更好,仅是值:
A B C D E F G
1 2 10 15 NaN NaN NaN
5 Nan NaN 10 NaN NaN 2
NaN NaN NaN 6 7 NaN NaN
NaN NaN NaN NaN NaN NaN 8
推荐答案
应用解决方案:
使用 分割
,用 dropna
, set_index
并转换一列 DataFrame
到系列
,由 DataFrame.squeeze
。最后 reindex
通过新列名:
Use split
by whitespace, remove NaN
rows by dropna
, set_index
and convert one column DataFrame
to Series
by DataFrame.squeeze
. Last reindex
by new column names:
print (df.apply(lambda x: x.str.split(expand=True)
.dropna()
.set_index(0)
.squeeze(), axis=1)
.reindex(columns=list('ABCDEFGH')))
A B C D E F G H
0 1 2 10 15 NaN NaN NaN NaN
1 5 NaN NaN 10 NaN NaN 2 NaN
2 NaN NaN NaN 6 7 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8 7
堆栈解决方案:
使用 stack
用于创建 Series
, split
并创建空白列,并在列中添加新列名( A
, B
...)由 set_index
,通过 Series
-docs / stable / generated / pandas.DataFrame.squeeze.html rel = nofollow> DataFrame.squeeze
,通过删除旧列名称的索引值 reset_index
, unstack
, 重新索引
通过新的列名(它添加由 NaN
填充的缺失列),通过 astype
最后通过 rename_axis删除列名
( pandas
0.18.0
中的新功能):
Use stack
for creating Series
, split
by whitespace and create new columns, append column with new column names (A
, B
...) to index
by set_index
, convert one column DataFrame
to Series
by DataFrame.squeeze
, remove index values with old column names by reset_index
, unstack
, reindex
by new column names (it add missing columns filled by NaN
),convert values to float
by astype
and last remove column name by rename_axis
(new in pandas
0.18.0
):
print (df.stack()
.str.split(expand=True)
.set_index(0, append=True)
.squeeze()
.reset_index(level=1, drop=True)
.unstack()
.reindex(columns=list('ABCDEFGH'))
.astype(float)
.rename_axis(None, axis=1))
A B C D E F G H
0 1.0 2.0 10.0 15.0 NaN NaN NaN NaN
1 5.0 NaN NaN 10.0 NaN NaN 2.0 NaN
2 NaN NaN NaN 6.0 7.0 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8.0 7.0
这篇关于需要通过读取带有随机列的csv文件来创建Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!