将列类型从字符串更改为在Pandas中浮动 [英] Change column type from string to float in Pandas

查看:90
本文介绍了将列类型从字符串更改为在Pandas中浮动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将表示为列表列表的表转换为 Pandas DataFrame 。作为一个极其简化的示例:

I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

将列转换为适当类型的最佳方法是什么,在这种情况下,将列2和3转换为浮点数?有没有一种方法可以在转换为DataFrame时指定类型?还是先创建DataFrame然后遍历各列以更改各列的类型更好?理想情况下,我想以一种动态的方式进行此操作,因为可以有数百个列,并且我不想确切指定哪些列属于哪种类型。我可以保证的是,每列都包含相同类型的值。

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.

推荐答案

在熊猫中转换类型有四个主要选项:

You have four main options for converting types in pandas:


  1. to_numeric() -提供了将非数字类型(例如字符串)安全地转换为合适的数字类型的功能。 (另请参见 to_datetime() to_timedelta() 。)

astype() -将(几乎)任何类型转换为(几乎)任何其他类型(即使这样做不一定明智)。还可以让您转换为类别类型(非常有用)。

astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).

infer_objects() -一种实用的方法,可以将保存Python对象的对象列转换为熊猫类型。

infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.

convert_dtypes() -将DataFrame列转换为最佳列支持 pd.NA 的dtype(熊猫对象表示缺少值)。

convert_dtypes() - convert DataFrame columns to the "best possible" dtype that supports pd.NA (pandas' object to indicate a missing value).

请继续阅读,以获取每种方法的更详细说明和用法。

Read on for more detailed explanations and usage of each of these methods.

将DataFrame的一个或多个列转换为数值的最佳方法是使用 pandas.to_numeric()

The best way to convert one or more columns of a DataFrame to numeric values is to use pandas.to_numeric().

此函数将尝试将非数字对象(例如字符串)更改为整数或浮点数。

This function will try to change non-numeric objects (such as strings) into integers or floating point numbers as appropriate.

to_numeric()的输入是Series或DataFrame的单个列。

The input to to_numeric() is a Series or a single column of a DataFrame.

>>> s = pd.Series(["8", 6, "7.5", 3, "0.9"]) # mixed string and numeric values
>>> s
0      8
1      6
2    7.5
3      3
4    0.9
dtype: object

>>> pd.to_numeric(s) # convert everything to float values
0    8.0
1    6.0
2    7.5
3    3.0
4    0.9
dtype: float64

如您所见,将返回一个新系列。请记住,将此输出分配给变量或列名以继续使用:

As you can see, a new Series is returned. Remember to assign this output to a variable or column name to continue using it:

# convert Series
my_series = pd.to_numeric(my_series)

# convert column "a" of a DataFrame
df["a"] = pd.to_numeric(df["a"])

您也可以通过 apply()方法使用它来转换DataFrame的多列:

You can also use it to convert multiple columns of a DataFrame via the apply() method:

# convert all columns of DataFrame
df = df.apply(pd.to_numeric) # convert all columns of DataFrame

# convert just columns "a" and "b"
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)

只要您的值都可以转换,那可能就是您所需要的全部。

As long as your values can all be converted, that's probably all you need.

但是如果某些值不能转换为数字类型怎么办?

But what if some values can't be converted to a numeric type?

to_numeric()也需要一个 errors 关键字参数,允许您将非数字值强制为,或仅忽略包含这些值的列。

to_numeric() also takes an errors keyword argument that allows you to force non-numeric values to be NaN, or simply ignore columns containing these values.

下面是一个使用一系列字符串 s 的示例,该字符串的对象为dtype:

Here's an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

默认行为是在无法转换值时引发。在这种情况下,它不能处理字符串 pandas:

The default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

我们可能希望将 pandas视为丢失/错误的数值,而不是失败。我们可以使用 errors 关键字参数,将无效值强制为 NaN ,如下所示:

Rather than fail, we might want 'pandas' to be considered a missing/bad numeric value. We can coerce invalid values to NaN as follows using the errors keyword argument:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

错误的第三种选择是忽略以下操作:遇到无效值:

The third option for errors is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

这最后一个如果您想转换整个DataFrame,但又不知道我们哪些列可以可靠地转换为数字类型,则option选项特别有用。在这种情况下,只需写:

This last option is particularly useful when you want to convert your entire DataFrame, but don't not know which of our columns can be converted reliably to a numeric type. In that case just write:

df.apply(pd.to_numeric, errors='ignore')

该函数将应用于DataFrame的每一列。可以转换为数字类型的列将被转换,而不能转换为数字类型的列(例如,它们包含非数字字符串或日期)将被保留。

The function will be applied to each column of the DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

默认情况下,将 to_numeric()转换为 int64 float64 dtype(或平台固有的整数宽度)。

By default, conversion with to_numeric() will give you either a int64 or float64 dtype (or whatever integer width is native to your platform).

通常这就是您想要的,但是如果要节省一些内存该怎么办并使用更紧凑的dtype,例如 float32 int8

That's usually what you want, but what if you wanted to save some memory and use a more compact dtype, like float32, or int8?

to_numeric()使您可以选择向下转换为'integer','signed','unsigned','float'。这是一个简单的整数类型 s 的示例:

to_numeric() gives you the option to downcast to either 'integer', 'signed', 'unsigned', 'float'. Here's an example for a simple series s of integer type:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

向下转换为整数将使用可能包含值的最小整数:

Downcasting to 'integer' uses the smallest possible integer that can hold the values:

>>> pd.to_numeric(s, downcast='integer')
0    1
1    2
2   -7
dtype: int8

向下转换为'float'同样会选择比普通的浮点类型小的

Downcasting to 'float' similarly picks a smaller than normal floating type:

>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -7.0
dtype: float32




2。 astype()


astype() 方法使您可以明确了解要使用DataFrame或Series的dtype具有。它非常通用,可以尝试从一种类型转换为另一种类型。


2. astype()

The astype() method enables you to be explicit about the dtype you want your DataFrame or Series to have. It's very versatile in that you can try and go from one type to the any other.

只需选择一种类型:即可使用NumPy dtype(例如 np.int16 ),某些Python类型(例如bool)或特定于熊猫的类型(例如类别dtype)。

Just pick a type: you can use a NumPy dtype (e.g. np.int16), some Python types (e.g. bool), or pandas-specific types (like the categorical dtype).

在要转换的对象上调用方法, astype()将尝试为您转换:

Call the method on the object you want to convert and astype() will try and convert it for you:

# convert all DataFrame columns to the int64 dtype
df = df.astype(int)

# convert column "a" to int64 dtype and "b" to complex type
df = df.astype({"a": int, "b": complex})

# convert Series to float16 type
s = s.astype(np.float16)

# convert Series to Python strings
s = s.astype(str)

# convert Series to categorical type - see docs for more details
s = s.astype('category')

通知我说尝试; -如果 astype()不知道如何转换Series或DataFrame中的值,则会引发错误。例如,如果您具有 NaN inf 值,则尝试将其转换为整数时会出错。

Notice I said "try" - if astype() does not know how to convert a value in the Series or DataFrame, it will raise an error. For example if you have a NaN or inf value you'll get an error trying to convert it to an integer.

自熊猫0.20.0起,可以通过传递 errors ='ignore'来抑制此错误。

As of pandas 0.20.0, this error can be suppressed by passing errors='ignore'. Your original object will be return untouched.

astype()功能强大,但有时会错误地转换值。例如:

astype() is powerful, but it will sometimes convert values "incorrectly". For example:

>>> s = pd.Series([1, 2, -7])
>>> s
0    1
1    2
2   -7
dtype: int64

这些都是小整数,那么如何转换为无符号8位类型以节省内存?

These are small integers, so how about converting to an unsigned 8-bit type to save memory?

>>> s.astype(np.uint8)
0      1
1      2
2    249
dtype: uint8

该转换有效,但是-7被换成了249(即2 8 -7)!

The conversion worked, but the -7 was wrapped round to become 249 (i.e. 28 - 7)!

试图相反,使用 pd.to_numeric(s,downcast ='unsigned')可以帮助防止该错误。

Trying to downcast using pd.to_numeric(s, downcast='unsigned') instead could help prevent this error.

pandas的0.21.0版本引入了 infer_objects() ,用于转换将对象数据类型设置为更特定的类型(软转换)。

Version 0.21.0 of pandas introduced the method infer_objects() for converting columns of a DataFrame that have an object datatype to a more specific type (soft conversions).

例如,这是一个带有两列对象类型的DataFrame。一个保存实际整数,另一个保存代表整数的字符串:

For example, here's a DataFrame with two columns of object type. One holds actual integers and the other holds strings representing integers:

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

使用 infer_objects(),您可以将列 a的类型更改为int64:

Using infer_objects(), you can change the type of column 'a' to int64:

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

列'b'具有由于它的值是字符串,而不是整数,因此可以单独使用。如果要尝试强制将两列都转换为整数类型,则可以使用 df.astype(int)

Column 'b' has been left alone since its values were strings, not integers. If you wanted to try and force the conversion of both columns to an integer type, you could use df.astype(int) instead.

1.0及更高版本包括方法 convert_dtypes() 到将Series和DataFrame列转换为支持 pd.NA 缺失值的最佳dtype。

Version 1.0 and above includes a method convert_dtypes() to convert Series and DataFrame columns to the best possible dtype that supports the pd.NA missing value.

此处为最佳。表示最适合保存值的类型。例如,如果所有值都是整数(或缺少值),则为熊猫整数类型:Python整数对象的对象列将转换为 Int64 ,即NumPy的列 int32 的值将成为熊猫的dtype Int32

Here "best possible" means the type most suited to hold the values. For example, this a pandas integer type if all of the values are integers (or missing values): an object column of Python integer objects is converted to Int64, a column of NumPy int32 values will become the pandas dtype Int32.

对于我们的 object DataFrame df ,我们得到以下结果:

With our object DataFrame df, we get the following result:

>>> df.convert_dtypes().dtypes                                             
a     Int64
b    string
dtype: object

列'a'保留整数值,它被转换为 Int64 类型(与 int64 )。

Since column 'a' held integer values, it was converted to the Int64 type (which is capable of holding missing values, unlike int64).

列'b'包含字符串对象,因此已更改为pandas的 string dtype。

Column 'b' contained string objects, so was changed to pandas' string dtype.

默认情况下,此方法将从每列的对象值中推断类型。我们可以通过传递 infer_objects = False 来更改此设置:

By default, this method will infer the type from object values in each column. We can change this by passing infer_objects=False:

>>> df.convert_dtypes(infer_objects=False).dtypes                          
a    object
b    string
dtype: object

现在列'a'仍然是对象列:熊猫知道它可以被描述为'整数'列(内部运行 infer_dtype ),但没有确切推断出它应该是什么dtype有没有这样做。列 b再次被转换为字符串 dtype,因为它被认为具有字符串值。

Now column 'a' remained an object column: pandas knows it can be described as an 'integer' column (internally it ran infer_dtype) but didn't infer exactly what dtype of integer it should have so did not convert it. Column 'b' was again converted to 'string' dtype as it was recognised as holding 'string' values.

这篇关于将列类型从字符串更改为在Pandas中浮动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆