如何在具有1000列的大型df中删除错误值 [英] How to remove error values in large df with 1000 columns

查看:36
本文介绍了如何在具有1000列的大型df中删除错误值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含1000多个列的大型数据集,该数据集杂乱地混合了dtypes.有2个int64列,119个浮点列和1266个对象列.

I have a large dataset with more than 1000 columns, the dataset is messy with mixed dtypes. There are 2 int64 columns, 119 float columns and 1266 object columns.

我想开始数据清理,但意识到有几个问题.由于列太多,因此对数据进行目视检查以查找错误非常繁琐.数据集示例位于

I would like to begin data cleaning but realised there are several issues. As there are too many columns, visual inspection of the data to locate errors is too tedious. An sample of the dataset is below

Company ID  Year    Date         Actual Loan Loss  Depreciation          Accounts Payable
001         2001    19 Oct 2001  100000.00         40000                 $$ER: 4540,NO DATA VALUES FOUND
002         2002    18 Sept 2001 NaN               $$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE
003         2004    01 Aug 2000  145000.00         5000                  Finance Dept

我想在删除空行之前删除所有错误变量.错误变量通常以"$$ ER:"开头

I would like to remove all the error variables before dropping the null rows. The error variables typically start with "$$ER:"

我尝试了以下

#load the dataset
df = pd.read_excel("path/file1.xlsx", sheet_name = "DATA_TS")
#examine the data
df.head(20)
#check number of rows, cols and dtypes
df.info()

#create a function to replace the error values

def convert_datatypes(val):
    new_val = val.replace('$$ER: 4540,NO DATA VALUES FOUND','').replace('$$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE', '')
    return new_val

df.apply(convert_datatypes)

该代码有效,但是我再次检查,发现还有其他错误值,例如"$$ ER:E100,INVALID CODE or EXPRESSION ENTERED".

The code worked but I checked again and realised that there were other error values such as "$$ER: E100,INVALID CODE OR EXPRESSION ENTERED".

我非常确定还有其他错误值,想了解是否还有其他方法可以有效地删除错误值,并在同一时间将列的dtype更改为假定正确的dtype(即从对象到int或str)?

I am pretty sure there are other error values as well, would like to find out if there are any other ways to efficiently remove the error values AND AT THE SAME TIME, change the dtype of the columns to the supposedly correct dtype (i.e., from object to either int or str)?

感谢任何形式的帮助,在此先感谢您!

Appreciate any form of help, thank you in advance!

推荐答案

这可以解决问题:

for col in df.columns[df.dtypes=='object']:
    df.loc[df[col].str.startswith('$$ER',na=False),col]=''

您还可以使用 contains(),但必须指定 regex = False

You can also use contains() but you will have to specify regex=False

for col in df.columns[df.dtypes=='object']:
    df.loc[df[col].str.contains('$$ER',na=False,regex=False),col]=''

这篇关于如何在具有1000列的大型df中删除错误值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆