将数据框拆分为相应命名的数组或序列(然后重新组合) [英] Split a dataframe into correspondingly named arrays or series (then recombine)

查看:87
本文介绍了将数据框拆分为相应命名的数组或序列(然后重新组合)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有一个带有x和y列的数据框.我想自动将其拆分为与列名称相同的数组(或系列),处理数据,然后再重新加入它们.手动执行此操作非常简单:

Let's say I have a dataframe with columns x and y. I'd like to automatically split it into arrays (or series) that have the same names as the columns, process the data, and then later rejoin them. It's pretty straightforward to do this manually:

x, y = df.x, df.y
z = x + y   # in actual use case, there are hundreds of lines like this
df = pd.concat([x,y,z],axis=1)

但是我想使它自动化.使用df.columns获取字符串列表很容易,但是我真的想要[x,y]而不是['x','y'].到目前为止,我能做的最好的就是与exec一起解决这个问题:

But I'd like to automate this. It's easy to get a list of strings with df.columns, but I really want [x,y] rather than ['x','y']. The best I can do so far is to work around that with exec:

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   for col in df.columns:
      exec( col + ' = df.' + col + '.values')

   z = x + y   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      exec( 'df.' + col + '=' + col )

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

所以有2个问题:

1)像这样使用exec通常不是一个好主意(当我尝试将其与numba结合使用时已经给我带来了问题)–还是那么糟糕?对于序列和数组,它似乎工作正常.

1) Using exec like this is generally not a good idea (and has already caused me a problem when I tried to combine this with numba) --or is that bad? It seems to work fine for series and arrays.

2)我不确定在这里利用视图的最佳方法.理想情况下,我真正想要做的就是将x用作df.x的视图.我以为x是数组是不可能的,但如果x是一个序列,也许是这样?

2) I'm not sure the best way to take advantage of views here. Ideally all that I really want to do here is use x as a view of df.x. I assume that is not possible where x is an array but maybe it is if x is a series?

以上示例是针对数组的,但理想情况下,我正在寻找一种适用于序列的解决方案.取而代之的是,当然欢迎一种可以与其他方法一起使用的解决方案.

The example above is for arrays, but ideally I'm looking for a solution that also applies to series. In lieu of that, solutions that work with one or the other are welcome of course.

动机:

1)可读性,可以用eval部分实现,但是我不认为eval可以在多行中使用?

1) Readability, which can partially be achieved with eval, but I don't believe eval can be used over multiple lines?

2)在z = x + y等多行的情况下,此方法对于序列(在我尝试过的示例中为2x或3x)而言要快一些,而对于数组(超过10x)甚至更快.参见此处:最快数字处理二维数组的方法:数据框vs系列vs数组vs numba

2) With multiple lines like z=x+y, this method is a little faster with series (2x or 3x in examples I've tried) and even faster with arrays (over 10x). See here: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

推荐答案

这并不能完全满足您的要求,但可以考虑另一条路径.

This doesn't do exactly what you want, but another path to think about.

有一个要旨此处,它定义了一个上下文管理器,允许您像引用列一样引用列是当地人.我没有写这个,虽然有点旧,但是似乎仍然可以与当前版本的熊猫一起使用.

There's a gist here that defines a context manager that allows you to reference columns as if they were locals. I didn't write this, and it's a little old, but still seems to work with the current version of pandas.

In [45]: df = pd.DataFrame({'x': np.random.randn(100000), 'y': np.random.randn(100000)})

In [46]: with DataFrameContextManager(df):
    ...:     z = x + y
    ...:     

In [47]: z.head()
Out[47]: 
0   -0.821079
1    0.035018
2    1.180576
3   -0.155916
4   -2.253515
dtype: float64

这篇关于将数据框拆分为相应命名的数组或序列(然后重新组合)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆