将数据帧转换为rec数组(并将对象转换为字符串) [英] Convert dataframe to a rec array (and objects to strings)

查看:96
本文介绍了将数据帧转换为rec数组(并将对象转换为字符串)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中包含要转换为numpy结构化数组(或记录数组,在这种情况下基本上是相同的东西)的数据类型(dtypes)的混合.对于纯数字数据帧,使用to_records()方法很容易做到.我还需要将pandas列的dtypes转换为 strings 而不是 objects ,以便可以使用numpy方法tofile()将数字和字符串输出为二进制文件,但不会输出对象.

I have a pandas dataframe with a mix of datatypes (dtypes) that I wish to convert to a numpy structured array (or record array, basically the same thing in this case). For purely numeric dataframes, this is easy to do with the to_records() method. I also need the dtypes of pandas columns to be converted to strings rather than objects so that I can use the numpy method tofile() which will output numbers and strings to a binary file, but will not output objects.

总而言之,我需要将具有dtype=object的pandas列转换为numpy结构化的字符串或unicode dtype数组.

In a nutshell, I need to convert pandas columns with dtype=object to numpy structured arrays of string or unicode dtype.

这是一个示例,如果所有列都具有数字(浮点型或整型)dtype,那么使用代码就足够了.

Here's an example, with code that would be sufficient if all columns had a numerical (float or int) dtype.

df=pd.DataFrame({'f_num': [1.,2.,3.], 'i_num':[1,2,3], 
                 'char': ['a','bb','ccc'], 'mixed':['a','bb',1]})

struct_arr=df.to_records(index=False)

print('struct_arr',struct_arr.dtype,'\n')

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', 'O'), ('mixed', 'O')]) 

但是因为我想以字符串dtypes结尾,所以我需要添加以下额外的,有点涉及的代码:

But because I want to end up with string dtypes, I need to add this additional and somewhat involved code:

lst=[]
for col in struct_arr.dtype.names:  # this was the only iterator I 
                                    # could find for the column labels
    dt=struct_arr[col].dtype

    if dt == 'O':   # this is 'O', meaning 'object'

        # it appears an explicit string length is required
        # so I calculate with pandas len & max methods
        dt = 'U' + str( df[col].astype(str).str.len().max() )

    lst.append((col,dt))

struct_arr = struct_arr.astype(lst)

print('struct_arr',struct_arr.dtype)

# struct_arr (numpy.record, [('f_num', '<f8'), ('i_num', '<i8'), 
#                            ('char', '<U3'), ('mixed', '<U2')])

另请参阅:如何更改numpy recarray的某些列的dtype?

这似乎可行,因为字符和混合dtype现在是<U3<U2而不是'O'或'object'.我只是在检查是否有一种更简单或更优雅的方法.但是由于熊猫不像numpy那样具有本机字符串类型,也许没有?

This seems to work, as the character and mixed dtypes are now <U3 and <U2 rather than 'O' or 'object'. I'm just checking if there is a simpler or more elegant approach. But since pandas does not have a native string type as numpy does, maybe there is not?

推荐答案

结合@jpp的建议(为简洁起见,列出列表)& @hpaulj(为了速度而取食to_records),我想到了以下内容,它是更干净的代码,并且比我的原始代码快约5倍(通过将上面的示例数据框扩展到10,000行进行测试):

Combining suggestions from @jpp (list comp for conciseness) & @hpaulj (cannibalize to_records for speed), I came up with the following, which is cleaner code and also about 5x faster than my original code (tested by expanding the sample dataframe above to 10,000 rows):

names = df.columns
arrays = [ df[col].get_values() for col in names ]

formats = [ array.dtype if array.dtype != 'O' 
            else f'{array.astype(str).dtype}' for array in arrays ] 

rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )

上面的代码将输出unicode而不是字符串,这通常可能更好一些,但是在我的情况下,我需要转换为字符串,因为我正在用fortran读取二进制文件,而字符串似乎更容易读取.因此,最好将上面的格式"行替换为:

The above will output unicode rather than strings which is probably better in general but in my case I need to convert to strings because I'm reading the binary file in fortran and strings seem to read in more easily. Hence, it may be better to replace the "formats" line above with this:

formats = [ array.dtype if array.dtype != 'O' 
            else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]

例如dtype <U4变为S4.

E.g. a dtype of <U4 becomes S4.

这篇关于将数据帧转换为rec数组(并将对象转换为字符串)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆