忽略pandas数据框中的非数字字符串值 [英] Ignoring non-numerical string values in pandas dataframe
问题描述
我有一个DataFrame,其中一列可能具有三种值,即整数(12331),作为字符串的整数('345')或某些其他字符串('text').
I have a DataFrame in which a column might have three kinds of values, integers (12331), integers as strings ('345') or some other string ('text').
有没有一种方法可以删除数据帧中带有最后一种字符串的所有行,并将第一种字符串转换为整数?或者,如果我对列进行求和,则至少可以通过某种方式忽略导致类型错误的行.
Is there a way to drop all rows with the last kind of string from the dataframe, and convert the first kind of string into integers? Or at least some way to ignore the rows that cause type errors if I'm summing the column.
此数据帧来自读取一个很大的CSV文件(25 GB),因此我希望有一些解决方案可以在大块读取时使用.
This dataframe is from reading a pretty big CSV file (25 GB), so I'd like some solution that would work when reading in chunks.
推荐答案
Pandas提供了一些用于转换此类列的工具,但它们可能并不完全适合您的需求. pd.to_numeric
会像您一样转换混合列,但会将非数字字符串转换为NaN
.这意味着您将获得浮点数列,而不是整数,因为只有浮点数列可以具有NaN
值.通常这没什么大不了,但是知道是一件好事.
Pandas has some tools for converting these kinds of columns, but they may not suit your needs exactly. pd.to_numeric
converts mixed columns like yours, but converts non-numeric strings to NaN
. This means you'll get float columns, not integer, since only float columns can have NaN
values. That usually doesn't matter too much but it's good to be aware of.
df = pd.DataFrame({'mixed_types': [12331, '345', 'text']})
pd.to_numeric(df['mixed_types'], errors='coerce')
Out[7]:
0 12331.0
1 345.0
2 NaN
Name: mixed_types, dtype: float64
如果要删除所有NaN
行:
# Replace the column with the converted values
df['mixed_types'] = pd.to_numeric(df['mixed_types'], errors='coerce')
# Drop NA values, listing the converted columns explicitly
# so NA values in other columns aren't dropped
df.dropna(subset = ['mixed_types'])
Out[11]:
mixed_types
0 12331.0
1 345.0
这篇关于忽略pandas数据框中的非数字字符串值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!