需要通过读取带有随机列的csv文件来创建Pandas数据框 [英] Need to create a Pandas dataframe by reading csv file with random columns

查看:114
本文介绍了需要通过读取带有随机列的csv文件来创建Pandas数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下带有记录的csv文件:

I have the following csv file with records:


  • A 1,B 2,C 10,D 15

  • A 5,D 10,G 2

  • D 6,E 7

  • H 7,G 8

  • A 1, B 2, C 10, D 15
  • A 5, D 10, G 2
  • D 6, E 7
  • H 7, G 8

我的列标题/名称为:A,B,C,D,E,F,G

My column headers/names are: A, B, C, D, E, F, G

因此,使用 read_csv后,我的初始数据帧变为:

So my initial dataframe after using "read_csv" becomes:

A     B     C      D       E      F      G   
A 1   B 2   C 10   D 15   NaN    NaN    NaN
A 5   D 10  G 2    NaN    NaN    NaN    NaN
D 6   E 7   NaN    NaN    NaN    NaN    NaN
H 7   G 8   NaN    NaN    NaN    NaN    Nan

该值可以分为[column name] [column value],因此A 1表示col = A且value = 1 ,而D 15表示col = D且value = 15,依此类推...

The value can be separate into [column name][column value], so A 1 means col=A and value=1, and D 15 means col=D and value=15, etc...

我想要的是基于$ b将数值分配给适当的列$ b并具有一个如下所示的数据框:

What I want is to assign the numeric value to the appropriate column based on the and have a dataframe that looks like this:

A     B     C      D       E      F      G   
A 1   B 2   C 10   D 15   NaN    NaN    NaN
A 5   Nan   NaN    D 10   NaN    NaN    G 2
NaN   NaN   NaN    D 6    E 7    NaN    NaN
NaN   NaN   NaN    NaN    NaN    NaN    G 8

甚至更好,仅是值:

A     B     C      D       E      F      G   
1     2     10     15      NaN    NaN    NaN
5     Nan   NaN    10      NaN    NaN    2
NaN   NaN   NaN    6       7      NaN    NaN
NaN   NaN   NaN    NaN     NaN    NaN    8


推荐答案

应用解决方案:

使用 分割 ,用NaN 行http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html rel = nofollow> dropna set_index 并转换一列 DataFrame 系列,由 DataFrame.squeeze 。最后 reindex 通过新列名:

Use split by whitespace, remove NaN rows by dropna, set_index and convert one column DataFrame to Series by DataFrame.squeeze. Last reindex by new column names:

print (df.apply(lambda x: x.str.split(expand=True)
                               .dropna()
                               .set_index(0)
                               .squeeze(), axis=1)
         .reindex(columns=list('ABCDEFGH')))

     A    B    C    D    E   F    G    H
0    1    2   10   15  NaN NaN  NaN  NaN
1    5  NaN  NaN   10  NaN NaN    2  NaN
2  NaN  NaN  NaN    6    7 NaN  NaN  NaN
3  NaN  NaN  NaN  NaN  NaN NaN    8    7

堆栈解决方案:

使用 stack 用于创建 Series split 并创建空白列,并在列中添加新列名( A B ...)由索引 http://pandas.pydata.org/pandas-docs/stable/generation/pandas.DataFrame.set_index.html rel = nofollow> set_index ,通过DataFrame 转换为 Series -docs / stable / generated / pandas.DataFrame.squeeze.html rel = nofollow> DataFrame.squeeze ,通过删除旧列名称的索引值 reset_index unstack 重新索引 通过新的列名(它添加由 NaN 填充的缺失列),通过float http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html rel = nofollow> astype 最后通过 rename_axis删除列名 pandas 0.18.0 中的新功能):

Use stack for creating Series, split by whitespace and create new columns, append column with new column names (A, B...) to index by set_index, convert one column DataFrame to Series by DataFrame.squeeze, remove index values with old column names by reset_index, unstack, reindex by new column names (it add missing columns filled by NaN),convert values to float by astype and last remove column name by rename_axis (new in pandas 0.18.0):

print (df.stack()
         .str.split(expand=True)
         .set_index(0, append=True)
         .squeeze()
         .reset_index(level=1, drop=True)
         .unstack()
         .reindex(columns=list('ABCDEFGH'))
         .astype(float)
         .rename_axis(None, axis=1))

     A    B     C     D    E   F    G    H
0  1.0  2.0  10.0  15.0  NaN NaN  NaN  NaN
1  5.0  NaN   NaN  10.0  NaN NaN  2.0  NaN
2  NaN  NaN   NaN   6.0  7.0 NaN  NaN  NaN
3  NaN  NaN   NaN   NaN  NaN NaN  8.0  7.0

这篇关于需要通过读取带有随机列的csv文件来创建Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆