使用包含多个类型的numpy数组创建Pandas DataFrame [英] Creating a Pandas DataFrame with a numpy array containing multiple types

查看:161
本文介绍了使用包含多个类型的numpy数组创建Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个熊猫数据框,其默认值为零,但一列整数,另一列为浮点数.我能够创建具有正确类型的numpy数组,请参见下面的values变量.但是,当我将其传递给数据框构造函数时,它仅返回NaN值(请参见下面的df).我包含了返回浮点数数组的无类型代码(请参见df2)

I want to create a pandas dataframe with default values of zero, but one column of integers and the other of floats. I am able to create a numpy array with the correct types, see the values variable below. However, when I pass that into the dataframe constructor, it only returns NaN values (see df below). I have include the untyped code that returns an array of floats(see df2)

import pandas as pd
import numpy as np

values = np.zeros((2,3), dtype='int32,float32')
index = ['x', 'y']
columns = ['a','b','c']

df = pd.DataFrame(data=values, index=index, columns=columns)
df.values.dtype

values2 = np.zeros((2,3))
df2 = pd.DataFrame(data=values2, index=index, columns=columns)
df2.values.dtype

关于如何构造数据框的任何建议?

Any suggestions on how to construct the dataframe?

推荐答案

以下是您可以选择的一些选项:

Here are a few options you could choose from:

import numpy as np
import pandas as pd

index = ['x', 'y']
columns = ['a','b','c']

# Option 1: Set the column names in the structured array's dtype 
dtype = [('a','int32'), ('b','float32'), ('c','float32')]
values = np.zeros(2, dtype=dtype)
df = pd.DataFrame(values, index=index)

# Option 2: Alter the structured array's column names after it has been created
values = np.zeros(2, dtype='int32, float32, float32')
values.dtype.names = columns
df2 = pd.DataFrame(values, index=index, columns=columns)

# Option 3: Alter the DataFrame's column names after it has been created
values = np.zeros(2, dtype='int32, float32, float32')
df3 = pd.DataFrame(values, index=index)
df3.columns = columns

# Option 4: Use a dict of arrays, each of the right dtype:
df4 = pd.DataFrame(
    {'a': np.zeros(2, dtype='int32'),
     'b': np.zeros(2, dtype='float32'),
     'c': np.zeros(2, dtype='float32')}, index=index, columns=columns)

# Option 5: Concatenate DataFrames of the simple dtypes:
df5 = pd.concat([
    pd.DataFrame(np.zeros((2,), dtype='int32'), columns=['a']), 
    pd.DataFrame(np.zeros((2,2), dtype='float32'), columns=['b','c'])], axis=1)

# Option 6: Alter the dtypes after the DataFrame has been formed. (This is not very efficient)
values2 = np.zeros((2, 3))
df6 = pd.DataFrame(values2, index=index, columns=columns)
for col, dtype in zip(df6.columns, 'int32 float32 float32'.split()):
    df6[col] = df6[col].astype(dtype)

上面的每个选项都会产生相同的结果

Each of the options above produce the same result

   a  b  c
x  0  0  0
y  0  0  0

具有dtypes:

a      int32
b    float32
c    float32
dtype: object


为什么pd.DataFrame(values, index=index, columns=columns)生成具有NaN的DataFrame :


Why pd.DataFrame(values, index=index, columns=columns) produces a DataFrame with NaNs:

values是具有列名称f0f1f2的结构化数组:

values is a structured array with column names f0, f1, f2:

In [171]:  values
Out[172]: 
array([(0, 0.0, 0.0), (0, 0.0, 0.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<f4')])

如果将参数columns=['a', 'b', 'c']传递给pd.DataFrame,则Pandas将在结构化数组values中查找具有这些名称的列.当找不到这些列时,Pandas将NaN放置在DataFrame中以表示缺少的值.

If you pass the argument columns=['a', 'b', 'c'] to pd.DataFrame, then Pandas will look for columns with those names in the structured array values. When those columns are not found, Pandas places NaNs in the DataFrame to represent missing values.

这篇关于使用包含多个类型的numpy数组创建Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆