我如何将缺失的行插入到这个数据集中? [英] How would I insert missing rows into this data set?
问题描述
我想做的是在缺少一行时将记录插入到数据集中.
如果您查看上面的数据集,它包含 3 列属性和 2 个数值.第三列 TTF 是增量的,不应跳过任何值.在此示例中,它缺少显示在底部的 2 行.所以我希望我的代码做的是将这 2 行插入到结果集中(即计算机 - 显示器缺少 5 的 TTF,而电视 - 电源缺少 6 的 TTF.我将修复值设置为 0,并且运行总值与前一行相同).
我想我会通过拆分列名并递归遍历前 2 个,然后从 1 到 8 个遍历第三个来处理它.
for i in range(len(Product)):对于范围内的 j(len(Module)):对于范围 (1, 8) 中的 k:# 检查Repair 值是否存在,如果不存在则设为0# 如果修复值缺失,查找之前的运行总数
这看起来是最好的方法吗?对实现此目的的实际代码的任何帮助将不胜感激.
这是 DF 中的代码读取,因为根据 excel 屏幕截图,这似乎令人困惑.
<预><代码>>>>将熊猫导入为 pd>>>>>>df = pd.read_csv('minimal.csv')>>>>>>df产品模块 TTF 维修运行总计0 电脑显示器 1 3 31 电脑显示器 2 2 52 电脑显示器 3 1 63 电脑显示器 4 5 114 电脑显示器 6 4 155 电脑显示器 7 3 186 电脑显示器 8 2 207 电视电源 1 7 78 电视电源 2 6 139 电视电源 3 4 1710 电视电源 4 5 2211 电视电源 5 6 2812 电视电源 7 7 3513 电视电源 8 8 43让我们使用 reindex
用 np.arange
为缺失的数字创建新的 TTF:>
df = pd.DataFrame({'Product':['Computer']*7 + ['Television']*7,'Module':['Display']*7 + ['Power Supply']*7,'TTF':[1,2,3,4,6,7,8,1,2,3,4,5,7,8],'修复':np.random.randint(1,8,14)})df['运行总计'] = df['修复'].cumsum()打印(df)
输入数据框:
模块产品维修 TTF 运行总计0 显示计算机 6 1 61 显示计算机 2 2 82 显示计算机 2 3 103 显示计算机 4 4 144 显示计算机 2 6 165 显示计算机 3 7 196 显示计算机 6 8 257 电源电视 3 1 288 电源电视 3 2 319 电源电视 5 3 3610 电源电视 6 4 4211 电源电视 4 5 4612 电源电视 2 7 4813 电源电视 2 8 50df_out = df.set_index('TTF').groupby(['Product','Module'], group_keys=False).apply(lambda x: x.reindex(np.arange(1,9)))df_out['修复'] = df_out['修复'].fillna(0)df_out = df_out.ffill().reset_index()打印(df_out)
输出:
TTF 模块产品维修 运行 全面维修0 1 显示计算机 6.0 6.0 6.01 2 显示计算机 2.0 8.0 2.02 3 显示计算机 2.0 10.0 2.03 4 显示计算机 4.0 14.0 4.04 5 显示计算机 4.0 14.0 0.05 6 显示计算机 2.0 16.0 2.06 7 显示计算机 3.0 19.0 3.07 8 显示计算机 6.0 25.0 6.08 1 电源电视 3.0 28.0 3.09 2 电源 电视 3.0 31.0 3.010 3 电源电视 5.0 36.0 5.011 4 电源电视 6.0 42.0 6.012 5 电源 电视 4.0 46.0 4.013 6 电源电视 4.0 46.0 0.014 7 电源 电视 2.0 48.0 2.015 8 电源 电视 2.0 50.0 2.0
What I am trying to do is insert records into a dataset whenever a line is missing.
If you look at the data set above, it contains 3 columns of attributes and then 2 numeric values. The third column TTF, is incremental and should not skip any values. In this example it is missing 2 rows which are shown at the bottom. So what I want my code to do would be insert those 2 rows into the result set (i.e. Computer - Display is missing TTF of 5, and Television - Power Supply is missing TTF of 6. I would set the repair value to 0, and the running total value to the same as the previous row).
I was thinking I would approach it by splitting the column names and recursively walking through the first 2, and then 1 to 8 for the third.
for i in range(len(Product)):
for j in range(len(Module)):
for k in range(1, 8):
# Check if the Repair value is there if not make it 0
# If Repair value is missing, look up previous Running Total
Does this seem like the best approach? Any help with the actual code to accomplish this would really be appreciated.
EDIT: Here is code reading in the DF, since that seems to be confusing based on the excel screenshot.
>>> import pandas as pd
>>>
>>> df = pd.read_csv('minimal.csv')
>>>
>>> df
Product Module TTF Repair Running Total
0 Computer Display 1 3 3
1 Computer Display 2 2 5
2 Computer Display 3 1 6
3 Computer Display 4 5 11
4 Computer Display 6 4 15
5 Computer Display 7 3 18
6 Computer Display 8 2 20
7 Television Power Supply 1 7 7
8 Television Power Supply 2 6 13
9 Television Power Supply 3 4 17
10 Television Power Supply 4 5 22
11 Television Power Supply 5 6 28
12 Television Power Supply 7 7 35
13 Television Power Supply 8 8 43
Let's use reindex
to create new TTF for missing number in sequence with np.arange
:
df = pd.DataFrame({'Product':['Computer']*7 + ['Television']*7,'Module':['Display']*7 + ['Power Supply']*7,
'TTF':[1,2,3,4,6,7,8,1,2,3,4,5,7,8],'Repair':np.random.randint(1,8,14)})
df['Running Total'] = df['Repair'].cumsum()
print(df)
Input Dataframe:
Module Product Repair TTF Running Total
0 Display Computer 6 1 6
1 Display Computer 2 2 8
2 Display Computer 2 3 10
3 Display Computer 4 4 14
4 Display Computer 2 6 16
5 Display Computer 3 7 19
6 Display Computer 6 8 25
7 Power Supply Television 3 1 28
8 Power Supply Television 3 2 31
9 Power Supply Television 5 3 36
10 Power Supply Television 6 4 42
11 Power Supply Television 4 5 46
12 Power Supply Television 2 7 48
13 Power Supply Television 2 8 50
df_out = df.set_index('TTF').groupby(['Product','Module'], group_keys=False).apply(lambda x: x.reindex(np.arange(1,9)))
df_out['repair'] = df_out['Repair'].fillna(0)
df_out = df_out.ffill().reset_index()
print(df_out)
Output:
TTF Module Product Repair Running Total repair
0 1 Display Computer 6.0 6.0 6.0
1 2 Display Computer 2.0 8.0 2.0
2 3 Display Computer 2.0 10.0 2.0
3 4 Display Computer 4.0 14.0 4.0
4 5 Display Computer 4.0 14.0 0.0
5 6 Display Computer 2.0 16.0 2.0
6 7 Display Computer 3.0 19.0 3.0
7 8 Display Computer 6.0 25.0 6.0
8 1 Power Supply Television 3.0 28.0 3.0
9 2 Power Supply Television 3.0 31.0 3.0
10 3 Power Supply Television 5.0 36.0 5.0
11 4 Power Supply Television 6.0 42.0 6.0
12 5 Power Supply Television 4.0 46.0 4.0
13 6 Power Supply Television 4.0 46.0 0.0
14 7 Power Supply Television 2.0 48.0 2.0
15 8 Power Supply Television 2.0 50.0 2.0
这篇关于我如何将缺失的行插入到这个数据集中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!