如何将标记的列解析成Pandas Dataframe(某些列值缺失)? [英] How to parse labeled values of columns into a Pandas Dataframe (some column values are missing)?
问题描述
以下是我的未标记数据集的两行,一小部分:
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH:COL:000.1 III:COL:2 JJJ:COL:0
random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL:HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212
前九列在整个数据集中是一致的,并且可以被标记。
我的问题是与以下列。然后,该行中的每个值首先用列值标记,例如。 AAA:COL:UVTWUVWDUWDUWDWW
是列 AAA
, BBB:COL:F
是列 BBB
等。
但是,(1)每行不具有相同的数字的列和(2)一些列是丢失。第一行缺少列 FFF
,第二行跳过列 CCC
和 HHH
。另外,请注意,第一行在 JJJ
列处停止,而第二列停在列 MMM
。
如何分配数据帧的9 + 13列,并解析这些值,以便如果列:值
对不存在,此列将具有 NaN
值。
像 pandas.read_table()
会有这样的功能吗?
这是第一行的正确格式:
random int sub int2 int3 string1 int4 int5 string2 AAA BBB CCC DDD EEE FFF GGG .... MMM
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX UVTWUVWDUWDUWDWW F DFADFADFA K20 54T'NaN'-30.5 .... 'NaN'
相关(和未回答)问题在这里:如何将未标记和缺失的列导入熊猫数据框?
这样做:
text =random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH: COL:000.1 III:COL:2 JJJ:COL:0
random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL: HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212
data = [line.split )for text in text.split('\\\
')]
data1 = [line [:9] for line in data]
data2 = [line [9:] for line in data]
#data2中的字典列表,其中我解析列
dict2 = [[dict([d.split(':COL:')for d in d1])for d1 in data2 ]
result = pd.concat([pd.DataFrame(data1),
pd.DataFrame(dict2)],
axis = 1)
result.iloc [:, 9:]
The follow are two rows from my unlabeled dataset, a small subset:
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH:COL:000.1 III:COL:2 JJJ:COL:0
random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL:HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212
The first nine columns are consistent throughout the dataset, and could be labeled.
My problem is with the following columns. Each value in this row is then labeled with the column value first, e.g. AAA:COL:UVTWUVWDUWDUWDWW
is column AAA
, BBB:COL:F
is column BBB
, etc.
However, (1) each row does not have the same number of columns and (2) some columns are "missing". The first row is missing column FFF
, the second row skips column CCC
and HHH
.
Also, notice that the first row stops at column JJJ
, while the second column stops at column MMM
.
How would one allocate 9 + 13 columns of a dataframe, and parse these values such that if a column:value
pair didn't exist, this column would have a NaN
value.
Would something like pandas.read_table()
have the functionality for this?
This is the "correct" format for the first row:
random int sub int2 int3 string1 int4 int5 string2 AAA BBB CCC DDD EEE FFF GGG .... MMM
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX UVTWUVWDUWDUWDWW F DFADFADFA K20 54T 'NaN' -30.5 ....'NaN'
Related (and unanswered) question here: How to import unlabeled and missing columns into a pandas dataframe?
This will do it:
text = """random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH:COL:000.1 III:COL:2 JJJ:COL:0
random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL:HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212"""
data = [line.split() for line in text.split('\n')]
data1 = [line[:9] for line in data]
data2 = [line[9:] for line in data]
# list of dictionaries from data2, where I parse the columns
dict2 = [[dict([d.split(':COL:') for d in d1]) for d1 in data2]
result = pd.concat([pd.DataFrame(data1),
pd.DataFrame(dict2)],
axis=1)
result.iloc[:, 9:]
这篇关于如何将标记的列解析成Pandas Dataframe(某些列值缺失)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!