使用Python/Pandas清除Dataframe中的错误标头 [英] Clean wrong header inside Dataframe with Python/Pandas
问题描述
我有一个损坏的数据框,该数据框内有随机的标头重复项.在加载数据框时如何忽略或删除这些行?
I've got a corrupt data frame with random header duplicates inside the data frame. How to ignore or delete these rows while loading the data frame?
由于此随机标头位于数据帧中,因此熊猫在加载时会引发错误.在加载大熊猫时,我想忽略这一行.或以某种方式删除它,然后再将其装入熊猫.
Since this random header is in the data frame, pandas raise an error while loading. I would like to ignore this row while loading it with pandas. Or delete it somehow, before loading it with pandas.
文件如下:
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3 <- this is the random copy of the header inside the dataframe
0, 1, 1
0, 0, 0
1, 1, 1
我想要:
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
0, 1, 1
0, 0, 0
1, 1, 1
推荐答案
放入 na_filter = False
以将您的列转换为字符串.然后找到所有包含不良数据的行,然后将其过滤出您的数据框.
Throw in na_filter = False
to typecast your columns into strings. Then locate all rows with bad data then filter them out your dataframe.
>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)
>>> df
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
3 col1 col2 col3
4 0 1 1
5 0 0 0
6 1 1 1
>>> type(df.iloc[0,0])
<class 'str'>
现在,您将每列中的数据解析为字符串,在df中找到所有 col1,col2和col3
值,如果使用 np在每列中找到它们,则创建一个新列.where()
这样:
Now that you parsed your data in each column as strings, locate all col1, col2, and col3
values in your df, create a new column if you find them each column using np.where()
as such:
>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
col1 col2 col3 Tag
0 0 1 1 Don't remove
1 0 0 0 Don't remove
2 1 1 1 Don't remove
3 col1 col2 col3 Remove
4 0 1 1 Don't remove
5 0 0 0 Don't remove
6 1 1 1 Don't remove
现在,使用 isin()
过滤掉 Tag
列中标记为 Removed
的代码.
Now, filter out the one tagged as Removed
in the Tag
column using isin()
.
>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
col1 col2 col3 Tag
0 0 1 1 Don't remove
1 0 0 0 Don't remove
2 1 1 1 Don't remove
4 0 1 1 Don't remove
5 0 0 0 Don't remove
6 1 1 1 Don't remove
拖放 Tag
列:
>>> df2 = df2[['col1', 'col2', 'col3']]
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
4 0 1 1
5 0 0 0
6 1 1 1
最后,如果需要将数据帧转换为整数,则将其转换为整数:
Finally typecast your dataframe into int, if you need it to be an integer:
>>> df2 = df2.astype(int)
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
4 0 1 1
5 0 0 0
6 1 1 1
>>> type(df2['col1'][0])
<class 'numpy.int32'>
注意:如果要使用标准索引:
Note: If you want standard index use:
>>> df2.reset_index(inplace = True, drop = True)
>>> df2
col1 col2 col3
0 0 1 1
1 0 0 0
2 1 1 1
3 0 1 1
4 0 0 0
5 1 1 1
这篇关于使用Python/Pandas清除Dataframe中的错误标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!