如何将标记的列解析成Pandas Dataframe(某些列值缺失)? [英] How to parse labeled values of columns into a Pandas Dataframe (some column values are missing)?

查看:129
本文介绍了如何将标记的列解析成Pandas Dataframe(某些列值缺失)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我的未标记数据集的两行,一小部分:

  random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH:COL:000.1 III:COL:2 JJJ:COL:0 

random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL:HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212

前九列在整个数据集中是一致的,并且可以被标记。



我的问题是与以下列。然后,该行中的每个值首先用列值标记,例如。 AAA:COL:UVTWUVWDUWDUWDWW 是列 AAA BBB:COL:F 是列 BBB 等。



但是,(1)每行不具有相同的数字的列和(2)一些列是丢失。第一行缺少列 FFF ,第二行跳过列 CCC HHH 。另外,请注意,第一行在 JJJ 列处停止,而第二列停在列 MMM



如何分配数据帧的9 + 13列,并解析这些值,以便如果列:值对不存在,此列将具有 NaN 值。



pandas.read_table()会有这样的功能吗?



这是第一行的正确格式:

  random int sub int2 int3 string1 int4 int5 string2 AAA BBB CCC DDD EEE FFF GGG .... MMM 
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX UVTWUVWDUWDUWDWW F DFADFADFA K20 54T'NaN'-30.5 .... 'NaN'

相关(和未回答)问题在这里:如何将未标记和缺失的列导入熊猫数据框?

解决方案

这样做:

  text =random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH: COL:000.1 III:COL:2 JJJ:COL:0 
random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL: HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212

data = [line.split )for text in text.split('\\\
')]
data1 = [line [:9] for line in data]
data2 = [line [9:] for line in data]

#data2中的字典列表,其中我解析列
dict2 = [[dict([d.split(':COL:')for d in d1])for d1 in data2 ]

result = pd.concat([pd.DataFrame(data1),
pd.DataFrame(dict2)],
axis = 1)

result.iloc [:, 9:]


The follow are two rows from my unlabeled dataset, a small subset:

random1 147 sub1    95  34  dewdfa3 15000   -1238   SBAASBAQSBARSBATSBAUSBAXBELAAX  AAA:COL:UVTWUVWDUWDUWDWW    BBB:COL:F   CCC:COL:GTATGTCA    DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5   HHH:COL:000.1   III:COL:2   JJJ:COL:0   

random2 123 sub1    996 12  kwnc239 10027    144        LBPRLBPSLBRDLBSDLBSLLBWB    AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F   DDD:COL:CACGTCGG    EEE:COL:K19 FFF:COL:HCC16   GGG:COL:873 III:COL:-77 JJJ:COL:0   KKK:COL:0   LLL:COL:1   MMM:COL:212

The first nine columns are consistent throughout the dataset, and could be labeled.

My problem is with the following columns. Each value in this row is then labeled with the column value first, e.g. AAA:COL:UVTWUVWDUWDUWDWW is column AAA, BBB:COL:F is column BBB, etc.

However, (1) each row does not have the same number of columns and (2) some columns are "missing". The first row is missing column FFF, the second row skips column CCC and HHH.

Also, notice that the first row stops at column JJJ, while the second column stops at column MMM.

How would one allocate 9 + 13 columns of a dataframe, and parse these values such that if a column:value pair didn't exist, this column would have a NaN value.

Would something like pandas.read_table() have the functionality for this?

This is the "correct" format for the first row:

random    int     sub    int2    int3    string1    int4    int5    string2                         AAA            BBB    CCC    DDD    EEE    FFF    GGG .... MMM
random1   147    sub1    95      34      dewdfa3    15000   -1238   SBAASBAQSBARSBATSBAUSBAXBELAAX  UVTWUVWDUWDUWDWW    F   DFADFADFA   K20 54T 'NaN' -30.5 ....'NaN'

Related (and unanswered) question here: How to import unlabeled and missing columns into a pandas dataframe?

解决方案

This will do it:

text = """random1 147 sub1    95  34  dewdfa3 15000   -1238   SBAASBAQSBARSBATSBAUSBAXBELAAX  AAA:COL:UVTWUVWDUWDUWDWW    BBB:COL:F   CCC:COL:GTATGTCA    DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5    HHH:COL:000.1   III:COL:2  JJJ:COL:0   
random2 123 sub1    996 12  kwnc239 10027    144        LBPRLBPSLBRDLBSDLBSLLBWB    AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F   DDD:COL:CACGTCGG    EEE:COL:K19 FFF:COL:HCC16   GGG:COL:873 III:COL:-77 JJJ:COL:0   KKK:COL:0   LLL:COL:1   MMM:COL:212"""

data = [line.split() for line in text.split('\n')]
data1 = [line[:9] for line in data]
data2 = [line[9:] for line in data]

# list of dictionaries from data2, where I parse the columns
dict2 = [[dict([d.split(':COL:') for d in d1]) for d1 in data2]

result = pd.concat([pd.DataFrame(data1),
                    pd.DataFrame(dict2)],
                   axis=1)

result.iloc[:, 9:]

这篇关于如何将标记的列解析成Pandas Dataframe(某些列值缺失)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆