pandas :更改列的数据类型 [英] Pandas: change data type of columns
问题描述
我想将一个以列表列表表示的表转换为Pandas DataFrame。作为一个非常简单的例子:
a = [['a','1.2','4.2'],[' ','70','0.03'],['x','5','0']]
pre>
df = pd.DataFrame(a)
将列转换为适当类型的最佳方法是什么?在这种情况下,列2和3将成为浮点数?转换为DataFrame时是否有方法指定类型?或者最好先创建DataFrame,然后循环遍历列以更改每列的类型?理想情况下,我想以动态方式执行此操作,因为可以有数百列,我不想指定哪些列是哪种类型。我可以保证,每列包含相同类型的值。
解决方案您可以使用
pd.to_numeric
(在版本0.17中引入)将列或系列转换为数字类型。该功能也可以使用apply
应用于DataFrame的多列。
重要的是,该功能还需要一个
错误
关键字参数可让您强制非数值为NaN
,或者只是忽略包含这些值的列。
示例用途如下所示。
个人列/系列
这是一个使用一系列字符串
s
的示例,其中包含对象dtype:>>>> s = pd.Series(['1','2','4.7','pandas','10'])
/ pre>
>>> s
0 1
1 2
2 4.7
3熊猫
4 10
dtype:对象
该函数的默认行为是提升,如果它不能转换一个值。在这种情况下,它不能处理字符串熊猫:
>>> pd.to_numeric(s)#或pd.to_numeric(s,errors ='raise')
ValueError:无法解析字符串
而不是失败,我们可能希望将大熊猫视为缺失/坏价值。我们可以将无效值强制为
NaN
,如下所示:>> ;> pd.to_numeric(s,errors ='coerce')
0 1.0
1 2.0
2 4.7
3 NaN
4 10.0
dtype:float64
第三个选项只是在遇到无效值时忽略该操作:
>>> pd.to_numeric(s,errors ='ignore')
#原始系列返回未修改
< h3>多列/整个DataFrames
我们可能希望将此操作应用于多个列。依次处理每一列是冗长乏味的,所以我们可以使用
DataFrame.apply
使函数在每列上执行。
从以下问题借用DataFrame:
>>> a = [['a','1.2','4.2'],['b','70','0.03'],['x','5','0']]
> ;>> df = pd.DataFrame(a,columns = ['col1','col2','col3'])
>>> df
col1 col2 col3
0 a 1.2 4.2
1 b 70 0.03
2 x 5 0
然后我们可以写:
df [['col2','col3' ]] = df [['col2','col3']]。apply(pd.to_numeric)
而现在'col2'和'col3'根据需要有dtype
float64
但是,我们可能不知道我们的列可以可靠地转换为数字类型。在这种情况下,我们可以写:
df.apply(lambda x:pd.to_numeric(x,errors ='ignore' ))
然后该函数将应用于整个 DataFrame。可以转换为数字类型的列将被转换,而不能(例如它们包含非数字字符串或日期)的列将被单独转换。
有
code> pd.to_timedelta 转换为日期和时间戳。I want to convert a table, represented as a list of lists, into a Pandas DataFrame. As an extremely simplified example:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] df = pd.DataFrame(a)
What is the best way to convert the columns to the appropriate types, in this case columns 2 and 3 into floats? Is there a way to specify the types while converting to DataFrame? Or is it better to create the DataFrame first and then loop through the columns to change the type for each column? Ideally I would like to do this in a dynamic way because there can be hundreds of columns and I don't want to specify exactly which columns are of which type. All I can guarantee is that each columns contains values of the same type.
解决方案You can use
pd.to_numeric
(introduced in version 0.17) to convert a column or a Series to a numeric type. The function can also be applied over multiple columns of a DataFrame usingapply
.Importantly, the function also takes an
errors
key word argument that lets you force not-numeric values to beNaN
, or simply ignore columns containing these values.Example uses are shown below.
Individual column / Series
Here's an example using a Series of strings
s
which has the object dtype:>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10']) >>> s 0 1 1 2 2 4.7 3 pandas 4 10 dtype: object
The function's default behaviour is to raise if it can't convert a value. In this case, it can't cope with the string 'pandas':
>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise') ValueError: Unable to parse string
Rather than fail, we might want 'pandas' to be considered a missing/bad value. We can coerce invalid values to
NaN
as follows:>>> pd.to_numeric(s, errors='coerce') 0 1.0 1 2.0 2 4.7 3 NaN 4 10.0 dtype: float64
The third option is just to ignore the operation if an invalid value is encountered:
>>> pd.to_numeric(s, errors='ignore') # the original Series is returned untouched
Multiple columns / entire DataFrames
We might want to apply this operation to multiple columns. Processing each column in turn is tedious, so we can use
DataFrame.apply
to have the function act on each column.Borrowing the DataFrame from the question:
>>> a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] >>> df = pd.DataFrame(a, columns=['col1','col2','col3']) >>> df col1 col2 col3 0 a 1.2 4.2 1 b 70 0.03 2 x 5 0
Then we can write:
df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)
and now 'col2' and 'col3' have dtype
float64
as desired.However, we might not know which of our columns can be converted reliably to a numeric type. In that case we can just write:
df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
Then the function will be applied to the whole DataFrame. Columns that can be converted to a numeric type will be converted, while columns that cannot (e.g. they contain non-digit strings or dates) will be left alone.
There is also
pd.to_datetime
andpd.to_timedelta
for conversion to dates and timestamps.这篇关于 pandas :更改列的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!