忽略pandas数据框中的非数字字符串值 [英] Ignoring non-numerical string values in pandas dataframe

查看:103
本文介绍了忽略pandas数据框中的非数字字符串值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个DataFrame,其中一列可能具有三种值,即整数(12331),作为字符串的整数('345')或某些其他字符串('text').

I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').

有没有一种方法可以删除数据帧中带有最后一种字符串的所有行,并将第一种字符串转换为整数?或者,如果我对列进行求和,则至少可以通过某种方式忽略导致类型错误的行.

Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.

此数据帧来自读取一个很大的CSV文件(25 GB),因此我希望有一些解决方案可以在大块读取时使用.

This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.

推荐答案

Pandas提供了一些用于转换此类列的工具,但它们可能并不完全适合您的需求. pd.to_numeric会像您一样转换混合列,但会将非数字字符串转换为NaN.这意味着您将获得浮点数列,而不是整数,因为只有浮点数列可以具有NaN值.通常这没什么大不了,但是知道是一件好事.

Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numeric converts mixed columns like yours, but converts non-numeric strings to NaN. This means you'll get float columns, not integer, since only float columns can have NaN values. That usually doesn't matter too much but it's good to be aware of.

df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})

pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]: 
0    12331.0
1      345.0
2        NaN
Name: mixed_types, dtype: float64

如果要删除所有NaN行:

# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')

# Drop NA values, listing the converted columns explicitly
#   so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]: 
   mixed_types
0      12331.0
1        345.0

这篇关于忽略pandas数据框中的非数字字符串值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆