如何在具有1000列的大型df中删除错误值 [英] How to remove error values in large df with 1000 columns
问题描述
我有一个包含1000多个列的大型数据集,该数据集杂乱地混合了dtypes.有2个int64列,119个浮点列和1266个对象列.
I have a large dataset with more than 1000 columns, the dataset is messy with mixed dtypes. There are 2 int64 columns, 119 float columns and 1266 object columns.
我想开始数据清理,但意识到有几个问题.由于列太多,因此对数据进行目视检查以查找错误非常繁琐.数据集示例位于
I would like to begin data cleaning but realised there are several issues. As there are too many columns, visual inspection of the data to locate errors is too tedious. An sample of the dataset is below
Company ID Year Date Actual Loan Loss Depreciation Accounts Payable
001 2001 19 Oct 2001 100000.00 40000 $$ER: 4540,NO DATA VALUES FOUND
002 2002 18 Sept 2001 NaN $$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE
003 2004 01 Aug 2000 145000.00 5000 Finance Dept
我想在删除空行之前删除所有错误变量.错误变量通常以"$$ ER:"开头
I would like to remove all the error variables before dropping the null rows. The error variables typically start with "$$ER:"
我尝试了以下
#load the dataset
df = pd.read_excel("path/file1.xlsx", sheet_name = "DATA_TS")
#examine the data
df.head(20)
#check number of rows, cols and dtypes
df.info()
#create a function to replace the error values
def convert_datatypes(val):
new_val = val.replace('$$ER: 4540,NO DATA VALUES FOUND','').replace('$$ER: E100,NO WORLDSCOPE DATA FOR THIS CODE', '')
return new_val
df.apply(convert_datatypes)
该代码有效,但是我再次检查,发现还有其他错误值,例如"$$ ER:E100,INVALID CODE or EXPRESSION ENTERED".
The code worked but I checked again and realised that there were other error values such as "$$ER: E100,INVALID CODE OR EXPRESSION ENTERED".
我非常确定还有其他错误值,想了解是否还有其他方法可以有效地删除错误值,并在同一时间将列的dtype更改为假定正确的dtype(即从对象到int或str)?
I am pretty sure there are other error values as well, would like to find out if there are any other ways to efficiently remove the error values AND AT THE SAME TIME, change the dtype of the columns to the supposedly correct dtype (i.e., from object to either int or str)?
感谢任何形式的帮助,在此先感谢您!
Appreciate any form of help, thank you in advance!
推荐答案
这可以解决问题:
for col in df.columns[df.dtypes=='object']:
df.loc[df[col].str.startswith('$$ER',na=False),col]=''
您还可以使用 contains()
,但必须指定 regex = False
You can also use contains()
but you will have to specify regex=False
for col in df.columns[df.dtypes=='object']:
df.loc[df[col].str.contains('$$ER',na=False,regex=False),col]=''
这篇关于如何在具有1000列的大型df中删除错误值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!