是否有可能将pandas/numpy行中的所有列转换为字节数组? [英] Is there any possibility to convert all columns in rows of pandas/numpy to arrays of bytes?

查看:83
本文介绍了是否有可能将pandas/numpy行中的所有列转换为字节数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将pandas/numpy数组的每一行转换为1,新列.我需要最快的方法.我试图找到一种将整个行提取为字节数组的方法,但是在不迭代所有列,将每个列值转换为字节并进行连接的情况下,找不到任何选项.

I need to convert each row of pandas/numpy array to 1, new column. I need the fastest method. I tried to find a method to extract full row as byte array, but cannot find any option, without iterating all columns, converting each column value to byte and concatenate.

在row_to_bytes函数中,我使用hashlib库和md5函数,但不需要加密.我应该在C/C ++中实现它还是应该使用一些库?

In function row_to_bytes I use hashlib library and md5 function, but I don't need cryptography. Should I implement it in C/C++ or maybe there is some library which I can use?

现在,这是我拥有的最好的方法,但是它非常慢(我的表具有5mln记录和40个属性).

Now, it's the best method I have, but it is very slow (I have table with 5mln records and 40 attributes).

hashed = df.apply(lambda row: self.row_to_bytes(row), axis=1)

感谢每个建议.

我创建了测试代码:

import pandas as pd  
import numpy as np  
df = pd.DataFrame([["1",1],["2",2]])  
x = df.values  

def compute(x):  
    dtype = np.dtype('S{:d}'.format(x.shape[1] * x.dtype.itemsize))  
    y = np.frombuffer(x.tobytes(), dtype=dtype)  
    print(y)  
compute(x)  

当我在命令行中多次运行代码时,我收到不同的结果:

When I run code in command line several times I receive different results:

python test.py
[b'\xb0\x8a\xbb\x8c\xf3\x01\x00\x000\x80og'
 b'p%\xc1\x8c\xf3\x01\x00\x00P\x80og'] 

python test.py     
[b'\xb0\x8aCr,\x02\x00\x000\x80og' b'p%^r,\x02\x00\x00P\x80og'] 

python test.py
[b'\xb0\x8a"\xb7\xc9\x01\x00\x000\x80og' b'p%=\xb7\xc9\x01\x00\x00P\x80og'] 

什么会引起其他问题?

推荐答案

无需循环.由于您需要每一行的字节,并且数组是行优先的,因此按顺序排列在内存中的字节恰好是您想要在数组的每个元素中使用的字节,只是分块方式不同.根据定义,这是对结果数组的重塑.您可以这样做:

No need to loop. Since you want the bytes from each row, and arrays are row-major, the bytes as they are laid out in memory are exactly the bytes you want in each element of your array, just chunked differently. This is by definition reshaping of the resulting array. You can do:

>>> x = np.arange(1000 * 2).reshape(100, 2)
>>> dtype = np.dtype('S{:d}'.format(x.shape[1] * x.dtype.itemsize))
>>> y = np.frombuffer(x.tobytes(), dtype=dtype)
>>> print(y[:5])
[b'\x00\x00\x00\x00\x00\x00\x00\x00\x01'
b'\x02\x00\x00\x00\x00\x00\x00\x00\x03'
b'\x04\x00\x00\x00\x00\x00\x00\x00\x05'
b'\x06\x00\x00\x00\x00\x00\x00\x00\x07'
b'\x08\x00\x00\x00\x00\x00\x00\x00\t']

这会将整个基础缓冲区重新解释为字节串.每个这样的字节串(dtype)的长度等于每行中的字节数.

This reinterprets the entire underlying buffer as bytestrings. Each such bytestring (the dtype) has length equal to the number of bytes in each row.

还有许多其他基于循环的方法可以执行此操作,但其中一种可能是使用np.fromiter.但是,从使用IPythontimeit魔术函数可以看到,我的第一个解决方案比这快几个数量级:

There are many other loop-based ways to do this, but one would be using np.fromiter. My first solution is several orders of magnitude faster than this, however, as seen by using IPython's timeit magic function:

In [32]: %timeit np.frombuffer(x.tobytes(), dtype='S16')
2.8 µs ± 318 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [33]: %timeit np.fromiter((row.tobytes() for row in x), dtype='S16')
614 µs ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这篇关于是否有可能将pandas/numpy行中的所有列转换为字节数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆