在pandas.read_excel()转换器中访问ValueError的详细信息 [英] Access specifics of ValueError in pandas.read_excel() converters

查看:400
本文介绍了在pandas.read_excel()转换器中访问ValueError的详细信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在执行操作之前,我正在使用以下内容来确保dataframe列具有正确的数据类型:

I'm using the following to ensure a dataframe column has the correct data type before I proceed with operations:

>>> cfun = lambda x: float(x)
>>> df = pd.read_excel(xl, converters={'column1': cfun})

使用转换器而不是dtype,这样回溯将明确告诉我是什么值引起了该问题:

Using converters instead of dtype so that the traceback will tell me explicitly what value caused the issue:

ValueError: could not convert string to float: '100%'

我想做的就是获取该信息(字符串"100%"是问题),并告诉用户它在数据帧/文件中的位置.如何从异常中获取该信息以获取行索引,例如打印整个行?

What I would like to do is take that information (that the string "100%" was the problem) and tell the user where it occurred in the dataframe/file. How can I get that information from the exception in order to get a row index and, say, print the entire row?

注意:添加百分号不是我的用户唯一的错误,否则我将用'替换任何'%'.

推荐答案

我认为您可以通过以下方法进行检查:首先读取csv,然后检查哪些行将无法转换.这样可以一次找到它们,而不是用ValueError一次找到它们.

I think you can check by first reading in the csv, and then checking which rows wouldn't convert. This finds them all at once, instead of one by one with the ValueError.

请记住,python从0开始编号,并且不会包含标头,因此df的行索引将与csv中的行索引相差1或2.

Just remember, python begins numbering at 0 and wont include the header so the row indices of the df will be off from those in the csv (by 1 or 2).

import pandas as pd
df = pd.read_excel(xl)

# Example df
   column1 column2
0      100       A
1     100%       B
2  112,312       C
3      171       D
4  123.123       E
5      NaN       F

df['column1_num'] = pd.to_numeric(df.column1, errors='coerce')
bad_mask = (df.column1_num.isnull()) & ~(df.column1.astype('str').str.lower().isin(['nan']))

bad_rows = df[bad_mask].index.values
#array([1, 2], dtype=int64)

df[bad_mask]
#   column1 column2  column1_num
#1     100%       B          NaN
#2  112,312       C          NaN

我更新了掩码,因为float能够处理'NaN'字符串,因此即使pd.to_numeric仍将其强制为NaN,它实际上也不会出现在您的阅读中.

I updated the mask because float is able to handle the 'NaN' string, so it wont actually show up as an issue in your read, though pd.to_numeric still coerces it to NaN.

float('NaN')
#nan
pd.to_numeric('NaN')
#ValueError: Unable to parse string "NaN" at position 0

这篇关于在pandas.read_excel()转换器中访问ValueError的详细信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆