pandas :更改列的数据类型 [英] Pandas: change data type of columns

查看:456
本文介绍了 pandas :更改列的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将一个以列表列表表示的表转换为Pandas DataFrame。作为一个非常简单的例子:

  a = [['a','1.2','4.2'],[' ','70','0.03'],['x','5','0']] 
df = pd.DataFrame(a)
pre>

将列转换为适当类型的最佳方法是什么?在这种情况下,列2和3将成为浮点数?转换为DataFrame时是否有方法指定类型?或者最好先创建DataFrame,然后循环遍历列以更改每列的类型?理想情况下,我想以动态方式执行此操作,因为可以有数百列,我不想指定哪些列是哪种类型。我可以保证,每列包含相同类型的值。

解决方案

您可以使用 pd.to_numeric (在版本0.17中引入)将列或系列转换为数字类型。该功能也可以使用 apply 应用于DataFrame的多列。



重要的是,该功能还需要一个错误关键字参数可让您强制非数值为 NaN ,或者只是忽略包含这些值的列。



示例用途如下所示。



个人列/系列



这是一个使用一系列字符串 s 的示例,其中包含对象dtype:

 >>>> s = pd.Series(['1','2','4.7','pandas','10'])
>>> s
0 1
1 2
2 4.7
3熊猫
4 10
dtype:对象
/ pre>

该函数的默认行为是提升,如果它不能转换一个值。在这种情况下,它不能处理字符串熊猫:

 >>> pd.to_numeric(s)#或pd.to_numeric(s,errors ='raise')
ValueError:无法解析字符串

而不是失败,我们可能希望将大熊猫视为缺失/坏价值。我们可以将无效值强制为 NaN ,如下所示:

 >> ;> pd.to_numeric(s,errors ='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype:float64

第三个选项只是在遇到无效值时忽略该操作:

 >>> pd.to_numeric(s,errors ='ignore')
#原始系列返回未修改



< h3>多列/整个DataFrames

我们可能希望将此操作应用于多个列。依次处理每一列是冗长乏味的,所以我们可以使用 DataFrame.apply 使函数在每列上执行。



从以下问题借用DataFrame:

 >>> a = [['a','1.2','4.2'],['b','70','0.03'],['x','5','0']] 
> ;>> df = pd.DataFrame(a,columns = ['col1','col2','col3'])
>>> df
col1 col2 col3
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0

然后我们可以写:

  df [['col2','col3' ]] = df [['col2','col3']]。apply(pd.to_numeric)

而现在'col2'和'col3'根据需要有dtype float64



但是,我们可能不知道我们的列可以可靠地转换为数字类型。在这种情况下,我们可以写:

  df.apply(lambda x:pd.to_numeric(x,errors ='ignore' ))

然后该函数将应用于整个 DataFrame。可以转换为数字类型的列将被转换,而不能(例如它们包含非数字字符串或日期)的列将被单独转换。



code> pd.to_timedelta 转换为日期和时间戳。


I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.

解决方案

You can use pd.to_numeric (introduced in version 0.17) to convert a column or a Series to a numeric type. The function can also be applied over multiple columns of a DataFrame using apply.

Importantly, the function also takes an errors key word argument that lets you force not-numeric values to be NaN, or simply ignore columns containing these values.

Example uses are shown below.

Individual column / Series

Here's an example using a Series of strings s which has the object dtype:

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

The function's default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

Rather than fail, we might want 'pandas' to be considered a missing/bad value. We can coerce invalid values to NaN as follows:

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

The third option is just to ignore the operation if an invalid value is encountered:

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

Multiple columns / entire DataFrames

We might want to apply this operation to multiple columns. Processing each column in turn is tedious, so we can use DataFrame.apply to have the function act on each column.

Borrowing the DataFrame from the question:

>>> a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
>>> df = pd.DataFrame(a, columns=['col1','col2','col3'])
>>> df
  col1 col2  col3
0    a  1.2   4.2
1    b   70  0.03
2    x    5     0

Then we can write:

df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)

and now 'col2' and 'col3' have dtype float64 as desired.

However, we might not know which of our columns can be converted reliably to a numeric type. In that case we can just write:

df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

Then the function will be applied to the whole DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.

There is also pd.to_datetime and pd.to_timedelta for conversion to dates and timestamps.

这篇关于 pandas :更改列的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆