更改 pandas 中的列类型 [英] Change column type in pandas

查看:17
本文介绍了更改 pandas 中的列类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将表示为列表列表的表格转换为 Pandas DataFrame.作为一个极其简化的例子:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]df = pd.DataFrame(a)

将列转换为适当类型的最佳方法是什么,在本例中将第 2 列和第 3 列转换为浮点数?有没有办法在转换为 DataFrame 时指定类型?或者最好先创建 DataFrame 然后遍历列以更改每列的类型?理想情况下,我想以动态方式执行此操作,因为可能有数百列,而且我不想确切指定哪些列属于哪种类型.我只能保证每列包含相同类型的值.

解决方案

在 pandas 中转换类型有四个主要选项:

  1. to_numeric() - 提供将非数字类型(例如字符串)安全地转换为合适的数字类型的功能.(另见to_datetime()to_timedelta().)

  2. astype() - 将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智).还允许您转换为 categorial 类型(非常有用).

  3. infer_objects() - 如果可能,将包含 Python 对象的对象列转换为 Pandas 类型的实用方法.

  4. convert_dtypes() - 将 DataFrame 列转换为最好的";支持 pd.NA 的数据类型(pandas 的对象表示缺失值).

继续阅读以了解每种方法的更详细说明和用法.


1.to_numeric()

将 DataFrame 的一列或多列转换为数值的最佳方法是使用 pandas.to_numeric().

此函数将尝试将非数字对象(例如字符串)适当地更改为整数或浮点数.

基本用法

to_numeric() 是一个系列或 DataFrame 的单列.

<预><代码>>>>s = pd.Series([8", 6, 7.5", 3, 0.9"]) # 混合字符串和数值>>>秒0 81 62 7.53 34 0.9数据类型:对象>>>pd.to_numeric(s) # 将所有内容转换为浮点值0 8.01 6.02 7.53 3.04 0.9数据类型:float64

如您所见,返回了一个新系列.请记住将此输出分配给变量或列名称以继续使用它:

# 转换系列my_series = pd.to_numeric(my_series)# 转换列a"数据帧的df[a"] = pd.to_numeric(df[a"])

您还可以通过 apply() 方法:

#转换DataFrame的所有列df = df.apply(pd.to_numeric) # 转换DataFrame的所有列# 只转换a"列和b"df[[a", b"]] = df[[a", b"]].apply(pd.to_numeric)

只要您的值都可以转换,这可能就是您所需要的.

错误处理

但是如果某些值无法转换为数字类型怎么办?

to_numeric()还采用 errors 关键字参数,允许您将非数字值强制为 NaN,或者只是忽略包含这些值的列.

这是使用具有对象数据类型的一系列字符串 s 的示例:

<预><代码>>>>s = pd.Series(['1', '2', '4.7', '熊猫', '10'])>>>秒0 11 22 4.73只熊猫4 10数据类型:对象

如果无法转换值,则默认行为是引发.在这种情况下,它无法处理字符串 'pandas':

<预><代码>>>>pd.to_numeric(s) # 或 pd.to_numeric(s, errors='raise')值错误:无法解析字符串

与其失败,我们可能希望将 'pandas' 视为缺失/错误的数值.我们可以使用 errors 关键字参数将无效值强制为 NaN,如下所示:

<预><代码>>>>pd.to_numeric(s, errors='coerce')0 1.01 2.02 4.73 南4 10.0数据类型:float64

errors 的第三个选项是在遇到无效值时忽略操作:

<预><代码>>>>pd.to_numeric(s, errors='忽略')# 原始系列原封不动地返回

最后一个选项对于转换整个 DataFrame 特别有用,但不知道我们的哪些列可以可靠地转换为数字类型.在这种情况下,只需写:

df.apply(pd.to_numeric, errors='ignore')

该函数将应用于 DataFrame 的每一列.可以转换为数字类型的列将被转换,而不能转换的列(例如它们包含非数字字符串或日期)将被保留.

低头

默认情况下,使用 to_numeric() 会给你一个 int64float64 dtype(或任何你平台原生的整数宽度).

这通常是您想要的,但是如果您想节省一些内存并使用更紧凑的数据类型,例如 float32int8,该怎么办?

to_numeric()让您可以选择向下转换为 'integer''signed''unsigned''float'.这是整数类型的简单系列 s 的示例:

<预><代码>>>>s = pd.Series([1, 2, -7])>>>秒0 11 22 -7数据类型:int64

向下转换为 'integer' 使用可以保存值的最小可能整数:

<预><代码>>>>pd.to_numeric(s, downcast='整数')0 11 22 -7数据类型:int8

向下转换为 'float' 类似地选择一个比普通浮动类型更小的浮动类型:

<预><代码>>>>pd.to_numeric(s, downcast='float')0 1.01 2.02 -7.0数据类型:float32


2.astype()

astype() 方法使您能够明确说明您希望 DataFrame 或 Series 具有的 dtype.它的用途非常广泛,您可以尝试从一种类型转换到另一种类型.

基本用法

只需选择一种类型:您可以使用 NumPy dtype(例如 np.int16)、某些 Python 类型(例如 bool)或 Pandas 特定类型(例如分类 dtype).

在要转换的对象上调用方法并astype() 将尝试为您转换它:

# 将所有 DataFrame 列转换为 int64 dtypedf = df.astype(int)# 转换列a"到 int64 dtype 和b";到复杂类型df = df.astype({"a": int, "b": complex})# 将系列转换为 float16 类型s = s.astype(np.float16)# 将系列转换为 Python 字符串s = s.astype(str)# 将系列转换为分类类型 - 有关更多详细信息,请参阅文档s = s.astype('类别')

注意我说的是尝试"- 如果 astype() 不知道如何转换 Series 或 DataFrame 中的值,会引发错误.例如,如果您有一个 NaNinf 值,您将在尝试将其转换为整数时遇到错误.

从 pandas 0.20.0 开始,可以通过传递 errors='ignore' 来抑制此错误.您的原始对象将原封不动地返回.

小心

astype() 很强大,但它有时会错误地"转换值.例如:

<预><代码>>>>s = pd.Series([1, 2, -7])>>>秒0 11 22 -7数据类型:int64

这些都是小整数,那么转换成一个无符号的 8 位类型来节省内存怎么样?

<预><代码>>>>s.astype(np.uint8)0 11 22 249数据类型:uint8

转换成功了,但 -7 被包裹成 249(即 28 - 7)!

尝试使用 pd.to_numeric(s, downcast='unsigned') 进行向下转换有助于防止出现此错误.


3.infer_objects()

pandas 0.21.0 版本引入了方法 infer_objects() 用于将具有对象数据类型的 DataFrame 列转换为更具体的类型(软转换).

例如,这是一个具有两列对象类型的 DataFrame.一个保存实际整数,另一个保存表示整数的字符串:

<预><代码>>>>df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')>>>df.dtypes一个对象b 对象数据类型:对象

使用infer_objects(),您可以将列 'a' 的类型更改为 int64:

<预><代码>>>>df = df.infer_objects()>>>df.dtypes一个 int64b 对象数据类型:对象

列 'b' 被保留了下来,因为它的值是字符串,而不是整数.如果你想强制两列都为整数类型,你可以使用 df.astype(int) 代替.


4.convert_dtypes()

1.0 及以上版本包括一个方法 convert_dtypes() 将 Series 和 DataFrame 列转换为支持 pd.NA 缺失值的最佳数据类型.

这里是最好的"表示最适合保存值的类型.比如这个pandas整型,如果所有的值都是整型(或者缺失值):Python整型对象的一列转化为Int64,NumPy的一列int32 值,将成为 pandas dtype Int32.

使用我们的object DataFrame df,我们得到以下结果:

<预><代码>>>>df.convert_dtypes().dtypes一个 Int64字符串数据类型:对象

由于列 'a' 保存整数值,它被转换为 Int64 类型(它能够保存缺失值,与 int64 不同).

列 'b' 包含字符串对象,因此更改为 pandas 的 string dtype.

默认情况下,此方法将从每列中的对象值推断类型.我们可以通过传递 infer_objects=False 来改变它:

<预><代码>>>>df.convert_dtypes(infer_objects=False).dtypes一个对象字符串数据类型:对象

现在列 'a' 仍然是一个对象列:pandas 知道它可以被描述为一个整数"列(在内部它运行 infer_dtype) 但不应该这样推断它到底是什么整数转换它.列 'b' 再次转换为 'string' dtype,因为它被识别为包含 'string' 值.

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.

解决方案

You have four main options for converting types in pandas:

  1. to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)

  2. astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

  3. infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

  4. convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).

Read on for more detailed explanations and usage of each of these methods.


1. to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.

Basic usage

The input to to_numeric() is a Series or a single column of a DataFrame.

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

You can also use it to convert multiple columns of a DataFrame via the apply() method:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

As long as your values can all be converted, that's probably all you need.

Error handling

But what if some values can't be converted to a numeric type?

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

Here's an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option for errors is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

This last option is particularly useful for converting your entire DataFrame, but don't know which of our columns can be converted reliably to a numeric type. In that case, just write:

df.apply(pd.to_numeric, errors='ignore')

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

Downcasting

By default, conversion with to_numeric() will give you either an int64 or float64 dtype (or whatever integer width is native to your platform).

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

Downcasting to 'integer' uses the smallest possible integer that can hold the values:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

Downcasting to 'float' similarly picks a smaller than normal floating type:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32


2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to any other.

Basic usage

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

Call the method on the object you want to convert and astype() will try and convert it for you:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example, if you have a NaN or inf value you'll get an error trying to convert it to an integer.

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be returned untouched.

Be careful

astype() is powerful, but it will sometimes convert values "incorrectly". For example:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.


3. infer_objects()

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

Using infer_objects(), you can change the type of column 'a' to int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

Column 'b' has been left alone since its values were strings, not integers. If you wanted to force both columns to an integer type, you could use df.astype(int) instead.


4. convert_dtypes()

Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type, if all of the values are integers (or missing values): an object column of Python integer objects are converted to Int64, a column of NumPy int32 values, will become the pandas dtype Int32.

With our object DataFrame df, we get the following result:

>>> df.convert_dtypes().dtypes                                             
a     Int64
b    string
dtype: object

Since column 'a' held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).

Column 'b' contained string objects, so was changed to pandas' string dtype.

By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:

>>> df.convert_dtypes(infer_objects=False).dtypes                          
a    object
b    string
dtype: object

Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.

这篇关于更改 pandas 中的列类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆