使用Python/Pandas清除Dataframe中的错误标头 [英] Clean wrong header inside Dataframe with Python/Pandas

查看:87
本文介绍了使用Python/Pandas清除Dataframe中的错误标头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个损坏的数据框,该数据框内有随机的标头重复项.在加载数据框时如何忽略或删除这些行?

I've got a corrupt data frame with random header duplicates inside the data frame. How to ignore or delete these rows while loading the data frame?

由于此随机标头位于数据帧中,因此熊猫在加载时会引发错误.在加载大熊猫时,我想忽略这一行.或以某种方式删除它,然后再将其装入熊猫.

Since this random header is in the data frame, pandas raise an error while loading. I would like to ignore this row while loading it with pandas. Or delete it somehow, before loading it with pandas.

文件如下:

col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3  <- this is the random copy of the header inside the dataframe
0, 1, 1
0, 0, 0
1, 1, 1

我想要:

col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
0, 1, 1
0, 0, 0
1, 1, 1

推荐答案

放入 na_filter = False 以将您的列转换为字符串.然后找到所有包含不良数据的行,然后将其过滤出您的数据框.

Throw in na_filter = False to typecast your columns into strings. Then locate all rows with bad data then filter them out your dataframe.

>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)
>>> df
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3  col1  col2  col3
4     0     1     1
5     0     0     0
6     1     1     1
>>> type(df.iloc[0,0])
<class 'str'>

现在,您将每列中的数据解析为字符串,在df中找到所有 col1,col2和col3 值,如果使用 np在每列中找到它们,则创建一个新列.where()这样:

Now that you parsed your data in each column as strings, locate all col1, col2, and col3 values in your df, create a new column if you find them each column using np.where() as such:

>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
   col1  col2  col3           Tag
0     0     1     1  Don't remove
1     0     0     0  Don't remove
2     1     1     1  Don't remove
3  col1  col2  col3        Remove
4     0     1     1  Don't remove
5     0     0     0  Don't remove
6     1     1     1  Don't remove

现在,使用 isin()过滤掉 Tag 列中标记为 Removed 的代码.

Now, filter out the one tagged as Removed in the Tag column using isin().

>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
  col1 col2 col3           Tag
0    0    1    1  Don't remove
1    0    0    0  Don't remove
2    1    1    1  Don't remove
4    0    1    1  Don't remove
5    0    0    0  Don't remove
6    1    1    1  Don't remove

拖放 Tag 列:

>>> df2 = df2[['col1', 'col2', 'col3']]
>>> df2
  col1 col2 col3
0    0    1    1
1    0    0    0
2    1    1    1
4    0    1    1
5    0    0    0
6    1    1    1

最后,如果需要将数据帧转换为整数,则将其转换为整数:

Finally typecast your dataframe into int, if you need it to be an integer:

>>> df2 = df2.astype(int)
>>> df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
4     0     1     1
5     0     0     0
6     1     1     1
>>> type(df2['col1'][0])
<class 'numpy.int32'>

注意:如果要使用标准索引:

Note: If you want standard index use:

>>> df2.reset_index(inplace = True, drop = True)
>>> df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3     0     1     1
4     0     0     0
5     1     1     1

这篇关于使用Python/Pandas清除Dataframe中的错误标头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆