将 Pandas 数据框转换为 NumPy 数组 [英] Convert pandas dataframe to NumPy array
问题描述
我有兴趣知道如何将 Pandas 数据帧转换为 NumPy 数组.
数据框:
将 numpy 导入为 np将熊猫导入为 pd索引 = [1, 2, 3, 4, 5, 6, 7]a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)df = df.rename_axis('ID')
给予
标签 A B CID1 NaN 0.2 NaN2 南 南 0.53 NaN 0.2 0.54 0.1 0.2 NaN5 0.1 0.2 0.56 0.1 NaN 0.57 0.1 NaN NaN
我想将其转换为 NumPy 数组,如下所示:
array([[ nan, 0.2, nan],[南,南,0.5],[南, 0.2, 0.5],[ 0.1, 0.2, 南],[ 0.1, 0.2, 0.5],[0.1,南,0.5],[ 0.1, 南, 南]])
我该怎么做?
<小时>作为奖励,是否可以像这样保留 dtype?
array([[ 1, nan, 0.2, nan],[ 2, 南, 南, 0.5],[ 3, 南, 0.2, 0.5],[ 4, 0.1, 0.2, 南],[ 5, 0.1, 0.2, 0.5],[ 6, 0.1, 南, 0.5],[ 7, 0.1, 南, 南]],dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])
或类似的?
df.to_numpy()
比 df.values
更好,原因如下.*
是时候弃用values
和as_matrix()
.
pandas v0.24.0
引入了两种从 Pandas 对象中获取 NumPy 数组的新方法:
to_numpy()
,定义在Index
、Series
和DataFrame上代码> 对象,和
array
,仅在Index
和Series
对象上定义.
如果您访问 .values
,你会看到一个很大的红色警告:
警告:我们建议改用 DataFrame.to_numpy()
.
见v0.24.0 发行说明的这一部分,以及这个答案了解更多信息.
* - to_numpy()
是我推荐的任何生产代码的方法,这些代码需要在未来的许多版本中可靠地运行.但是,如果您只是在 jupyter 或终端中制作便笺簿,则使用 .values
来节省几毫秒的输入时间是允许的例外.您可以随时添加适合 n 的饰面.
提高一致性:to_numpy()
本着在整个 API 中保持更好一致性的精神,引入了一种新方法 to_numpy
来从 DataFrame 中提取底层 NumPy 数组.
# 设置df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},index=['a', 'b', 'c'])# 转换整个DataFramedf.to_numpy()# 数组([[1, 4, 7],# [2, 5, 8],# [3, 6, 9]])# 转换特定列df[['A', 'C']].to_numpy()# 数组([[1, 7],# [2, 8],# [3, 9]])
如上所述,此方法也在Index
和Series
对象上定义(参见此处).
df.index.to_numpy()# 数组(['a', 'b', 'c'], dtype=object)df['A'].to_numpy()# 数组([1, 2, 3])
默认情况下,返回一个视图,因此所做的任何修改都会影响原始视图.
v = df.to_numpy()v[0, 0] = -1df乙丙-1 4 72 5 83 6 9
如果您需要一个副本,请使用to_numpy(copy=True)
.
pandas >= 1.0 更新扩展类型
如果您使用的是 Pandas 1.x,您可能会更多地处理扩展类型.您必须更加小心,才能正确转换这些扩展类型.
a = pd.array([1, 2, None], dtype=Int64")一种<整数数组>[1, 2, <NA>]长度:3,数据类型:Int64# 错误的a.to_numpy()# array([1, 2, <NA>], dtype=object) # 哎呀,对象# 正确的a.to_numpy(dtype='float', na_value=np.nan)# 数组([ 1., 2., nan])# 同样正确a.to_numpy(dtype='int', na_value=-1)# 数组([ 1, 2, -1])
这是在文档中提到.
如果您需要结果中的 dtypes
...
如另一个答案所示,DataFrame.to_records
是一个很好的方法.
df.to_records()# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],# dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
不幸的是,使用 to_numpy
无法做到这一点.但是,作为替代方案,您可以使用 np.rec.fromrecords
:
v = df.reset_index()np.rec.fromrecords(v, names=v.columns.tolist())# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],# dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
性能方面,几乎相同(实际上,使用 rec.fromrecords
会更快一些).
df2 = pd.concat([df] * 10000)%timeit df2.to_records()%%时间v = df2.reset_index()np.rec.fromrecords(v, names=v.columns.tolist())每个循环 12.9 ms ± 511 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)每个循环 9.56 ms ± 291 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)
添加新方法的理由
to_numpy()
(除了 array
)是在两个 GitHub 问题下讨论的结果 GH19954 和 GH23623.
具体来说,文档提到了基本原理:
<块引用>[...] with .values
不清楚返回的值是否是实际数组,它的一些转换,或熊猫自定义之一数组(如 Categorical
).例如,使用 PeriodIndex
, .values
每次生成一个新的 ndarray
周期对象.[...]
to_numpy
旨在提高 API 的一致性,这是朝着正确方向迈出的重要一步..values
在当前版本中不会被弃用,但我预计这可能会在未来的某个时候发生,所以我会敦促用户尽快迁移到更新的 API.
对其他解决方案的批评
DataFrame.values
有不一致的行为,如前所述.
DataFrame.get_values()
只是对 DataFrame.values
的包装,所以上面所说的一切都适用.
DataFrame.as_matrix()
现在已弃用,不要使用!
I am interested in knowing how to convert a pandas dataframe into a NumPy array.
dataframe:
import numpy as np
import pandas as pd
index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')
gives
label A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
I would like to convert this to a NumPy array, as so:
array([[ nan, 0.2, nan],
[ nan, nan, 0.5],
[ nan, 0.2, 0.5],
[ 0.1, 0.2, nan],
[ 0.1, 0.2, 0.5],
[ 0.1, nan, 0.5],
[ 0.1, nan, nan]])
How can I do this?
As a bonus, is it possible to preserve the dtypes, like this?
array([[ 1, nan, 0.2, nan],
[ 2, nan, nan, 0.5],
[ 3, nan, 0.2, 0.5],
[ 4, 0.1, 0.2, nan],
[ 5, 0.1, 0.2, 0.5],
[ 6, 0.1, nan, 0.5],
[ 7, 0.1, nan, nan]],
dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])
or similar?
df.to_numpy()
is better than df.values
, here's why.*
It's time to deprecate your usage of values
and as_matrix()
.
pandas v0.24.0
introduced two new methods for obtaining NumPy arrays from pandas objects:
to_numpy()
, which is defined onIndex
,Series
, andDataFrame
objects, andarray
, which is defined onIndex
andSeries
objects only.
If you visit the v0.24 docs for .values
, you will see a big red warning that says:
Warning: We recommend using
DataFrame.to_numpy()
instead.
See this section of the v0.24.0 release notes, and this answer for more information.
* - to_numpy()
is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values
to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.
Towards Better Consistency: to_numpy()
In the spirit of better consistency throughout the API, a new method to_numpy
has been introduced to extract the underlying NumPy array from DataFrames.
# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},
index=['a', 'b', 'c'])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
# [2, 5, 8],
# [3, 6, 9]])
# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
# [2, 8],
# [3, 9]])
As mentioned above, this method is also defined on Index
and Series
objects (see here).
df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)
df['A'].to_numpy()
# array([1, 2, 3])
By default, a view is returned, so any modifications made will affect the original.
v = df.to_numpy()
v[0, 0] = -1
df
A B C
a -1 4 7
b 2 5 8
c 3 6 9
If you need a copy instead, use to_numpy(copy=True)
.
pandas >= 1.0 update for ExtensionTypes
If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.
a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object) # yuck, objects
# Correct
a.to_numpy(dtype='float', na_value=np.nan)
# array([ 1., 2., nan])
# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1, 2, -1])
This is called out in the docs.
If you need the dtypes
in the result...
As shown in another answer, DataFrame.to_records
is a good way to do this.
df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
This cannot be done with to_numpy
, unfortunately. However, as an alternative, you can use np.rec.fromrecords
:
v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
# dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
Performance wise, it's nearly the same (actually, using rec.fromrecords
is a bit faster).
df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Rationale for Adding a New Method
to_numpy()
(in addition to array
) was added as a result of discussions under two GitHub issues GH19954 and GH23623.
Specifically, the docs mention the rationale:
[...] with
.values
it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (likeCategorical
). For example, withPeriodIndex
,.values
generates a newndarray
of period objects each time. [...]
to_numpy
aims to improve the consistency of the API, which is a major step in the right direction. .values
will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.
Critique of Other Solutions
DataFrame.values
has inconsistent behaviour, as already noted.
DataFrame.get_values()
is simply a wrapper around DataFrame.values
, so everything said above applies.
DataFrame.as_matrix()
is deprecated now, do NOT use!
这篇关于将 Pandas 数据框转换为 NumPy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!