将 Pandas 数据框转换为 NumPy 数组 [英] Convert pandas dataframe to NumPy array

查看:47
本文介绍了将 Pandas 数据框转换为 NumPy 数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有兴趣知道如何将 Pandas 数据帧转换为 NumPy 数组.

数据框:

将 numpy 导入为 np将熊猫导入为 pd索引 = [1, 2, 3, 4, 5, 6, 7]a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)df = df.rename_axis('ID')

给予

标签 A B CID1 NaN 0.2 NaN2 南 南 0.53 NaN 0.2 0.54 0.1 0.2 NaN5 0.1 0.2 0.56 0.1 NaN 0.57 0.1 NaN NaN

我想将其转换为 NumPy 数组,如下所示:

array([[ nan, 0.2, nan],[南,南,0.5],[南, 0.2, 0.5],[ 0.1, 0.2, 南],[ 0.1, 0.2, 0.5],[0.1,南,0.5],[ 0.1, 南, 南]])

我该怎么做?

<小时>

作为奖励,是否可以像这样保留 dtype?

array([[ 1, nan, 0.2, nan],[ 2, 南, 南, 0.5],[ 3, 南, 0.2, 0.5],[ 4, 0.1, 0.2, 南],[ 5, 0.1, 0.2, 0.5],[ 6, 0.1, 南, 0.5],[ 7, 0.1, 南, 南]],dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

或类似的?

解决方案

df.to_numpy()df.values 更好,原因如下.*

是时候弃用valuesas_matrix().

pandas v0.24.0 引入了两种从 Pandas 对象中获取 NumPy 数组的新方法:

  1. to_numpy(),定义在IndexSeriesDataFrame 对象,和
  2. array,仅在 IndexSeries 对象上定义.

如果您访问 .values,你会看到一个很大的红色警告:

<块引用>

警告:我们建议改用 DataFrame.to_numpy().

v0.24.0 发行说明的这一部分,以及这个答案了解更多信息.

* - to_numpy() 是我推荐的任何生产代码的方法,这些代码需要在未来的许多版本中可靠地运行.但是,如果您只是在 jupyter 或终端中制作便笺簿,则使用 .values 来节省几毫秒的输入时间是允许的例外.您可以随时添加适合 n 的饰面.



提高一致性:to_numpy()

本着在整个 API 中保持更好一致性的精神,引入了一种新方法 to_numpy 来从 DataFrame 中提取底层 NumPy 数组.

# 设置df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},index=['a', 'b', 'c'])# 转换整个DataFramedf.to_numpy()# 数组([[1, 4, 7],# [2, 5, 8],# [3, 6, 9]])# 转换特定列df[['A', 'C']].to_numpy()# 数组([[1, 7],# [2, 8],# [3, 9]])

如上所述,此方法也在IndexSeries 对象上定义(参见此处).

df.index.to_numpy()# 数组(['a', 'b', 'c'], dtype=object)df['A'].to_numpy()# 数组([1, 2, 3])

默认情况下,返回一个视图,因此所做的任何修改都会影响原始视图.

v = df.to_numpy()v[0, 0] = -1df乙丙-1 4 72 5 83 6 9

如果您需要一个副本,请使用to_numpy(copy=True).


pandas >= 1.0 更新扩展类型

如果您使用的是 Pandas 1.x,您可能会更多地处理扩展类型.您必须更加小心,才能正确转换这些扩展类型.

a = pd.array([1, 2, None], dtype=Int64")一种<整数数组>[1, 2, <NA>]长度:3,数据类型:Int64# 错误的a.to_numpy()# array([1, 2, <NA>], dtype=object) # 哎呀,对象# 正确的a.to_numpy(dtype='float', na_value=np.nan)# 数组([ 1., 2., nan])# 同样正确a.to_numpy(dtype='int', na_value=-1)# 数组([ 1, 2, -1])

这是在文档中提到.


如果您需要结果中的 dtypes...

如另一个答案所示,DataFrame.to_records 是一个很好的方法.

df.to_records()# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],# dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

不幸的是,使用 to_numpy 无法做到这一点.但是,作为替代方案,您可以使用 np.rec.fromrecords:

v = df.reset_index()np.rec.fromrecords(v, names=v.columns.tolist())# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],# dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

性能方面,几乎相同(实际上,使用 rec.fromrecords 会更快一些).

df2 = pd.concat([df] * 10000)%timeit df2.to_records()%%时间v = df2.reset_index()np.rec.fromrecords(v, names=v.columns.tolist())每个循环 12.9 ms ± 511 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)每个循环 9.56 ms ± 291 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)



添加新方法的理由

to_numpy()(除了 array)是在两个 GitHub 问题下讨论的结果 GH19954GH23623.

具体来说,文档提到了基本原理:

<块引用>

[...] with .values 不清楚返回的值是否是实际数组,它的一些转换,或熊猫自定义之一数组(如 Categorical).例如,使用 PeriodIndex, .values每次生成一个新的 ndarray 周期对象.[...]

to_numpy 旨在提高 API 的一致性,这是朝着正确方向迈出的重要一步..values 在当前版本中不会被弃用,但我预计这可能会在未来的某个时候发生,所以我会敦促用户尽快迁移到更新的 API.



对其他解决方案的批评

DataFrame.values 有不一致的行为,如前所述.

DataFrame.get_values() 只是对 DataFrame.values 的包装,所以上面所说的一切都适用.

DataFrame.as_matrix() 现在已弃用,不要使用!

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

import numpy as np
import pandas as pd

index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')

gives

label   A    B    C
ID                                 
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN

I would like to convert this to a NumPy array, as so:

array([[ nan,  0.2,  nan],
       [ nan,  nan,  0.5],
       [ nan,  0.2,  0.5],
       [ 0.1,  0.2,  nan],
       [ 0.1,  0.2,  0.5],
       [ 0.1,  nan,  0.5],
       [ 0.1,  nan,  nan]])

How can I do this?


As a bonus, is it possible to preserve the dtypes, like this?

array([[ 1, nan,  0.2,  nan],
       [ 2, nan,  nan,  0.5],
       [ 3, nan,  0.2,  0.5],
       [ 4, 0.1,  0.2,  nan],
       [ 5, 0.1,  0.2,  0.5],
       [ 6, 0.1,  nan,  0.5],
       [ 7, 0.1,  nan,  nan]],
     dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])

or similar?

解决方案

df.to_numpy() is better than df.values, here's why.*

It's time to deprecate your usage of values and as_matrix().

pandas v0.24.0 introduced two new methods for obtaining NumPy arrays from pandas objects:

  1. to_numpy(), which is defined on Index, Series, and DataFrame objects, and
  2. array, which is defined on Index and Series objects only.

If you visit the v0.24 docs for .values, you will see a big red warning that says:

Warning: We recommend using DataFrame.to_numpy() instead.

See this section of the v0.24.0 release notes, and this answer for more information.

* - to_numpy() is my recommended method for any production code that needs to run reliably for many versions into the future. However if you're just making a scratchpad in jupyter or the terminal, using .values to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.



Towards Better Consistency: to_numpy()

In the spirit of better consistency throughout the API, a new method to_numpy has been introduced to extract the underlying NumPy array from DataFrames.

# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, 
                  index=['a', 'b', 'c'])

# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
#        [2, 5, 8],
#        [3, 6, 9]])

# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
#        [2, 8],
#        [3, 9]])

As mentioned above, this method is also defined on Index and Series objects (see here).

df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)

df['A'].to_numpy()
#  array([1, 2, 3])

By default, a view is returned, so any modifications made will affect the original.

v = df.to_numpy()
v[0, 0] = -1
 
df
   A  B  C
a -1  4  7
b  2  5  8
c  3  6  9

If you need a copy instead, use to_numpy(copy=True).


pandas >= 1.0 update for ExtensionTypes

If you're using pandas 1.x, chances are you'll be dealing with extension types a lot more. You'll have to be a little more careful that these extension types are correctly converted.

a = pd.array([1, 2, None], dtype="Int64")                                  
a                                                                          

<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64 

# Wrong
a.to_numpy()                                                               
# array([1, 2, <NA>], dtype=object)  # yuck, objects

# Correct
a.to_numpy(dtype='float', na_value=np.nan)                                 
# array([ 1.,  2., nan])

# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1,  2, -1])

This is called out in the docs.


If you need the dtypes in the result...

As shown in another answer, DataFrame.to_records is a good way to do this.

df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

This cannot be done with to_numpy, unfortunately. However, as an alternative, you can use np.rec.fromrecords:

v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
#           dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

Performance wise, it's nearly the same (actually, using rec.fromrecords is a bit faster).

df2 = pd.concat([df] * 10000)

%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())

12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



Rationale for Adding a New Method

to_numpy() (in addition to array) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[...] with .values it was unclear whether the returned value would be the actual array, some transformation of it, or one of pandas custom arrays (like Categorical). For example, with PeriodIndex, .values generates a new ndarray of period objects each time. [...]

to_numpy aims to improve the consistency of the API, which is a major step in the right direction. .values will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.



Critique of Other Solutions

DataFrame.values has inconsistent behaviour, as already noted.

DataFrame.get_values() is simply a wrapper around DataFrame.values, so everything said above applies.

DataFrame.as_matrix() is deprecated now, do NOT use!

这篇关于将 Pandas 数据框转换为 NumPy 数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆