在数据框中查找累积特征? [英] Finding cumulative features in dataframe?

查看:99
本文介绍了在数据框中查找累积特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有约200个要素和3000行的datframe.这些数据样本在不同的时间记录,基本上每个月记录一次,如"col101"中的以下示例所示:

I have a datframe with around 200 features and 3000 rows. These data samples are logged in different time, basically one per month, as shown in the below example in "col101":

   0    col1 (id)    col2.    col3   ….   col100    col101 (date)  …     col2000 (target value)
   1        001         653.    675   ….      343.3   01-02-2017.   …                1
   2        001         673.    432   ….      387.3   01-03-2017.   …            0
   3        001         679.    528   ….      401.2   01-04-2017.   …            1
   4        001         685     223   ….      503.4   01-05-2017.   …            1
   5        002         343     428   ….      432.5   01-02-2017.   …            0
   6        002         479.    421   ….      455.3   01-03-2017.   …            0
   7         …             …         …     ….          …               ….            …            .. 

在这些功能中,一些是累积数据,因此每个月的值都在增加.例如,col2和col100是我数据框中的累积特征.因此,我想为每个累积功能再增加一列,与上个月相比有所不同.所以我想要的数据框应该是这样的:

Within these features some of are cumulative data so that in every month their values have been increased. For example, col2 and col100 are the cumulative features in my dataframe. So I want to add one more column for each cumulative feature, with the difference with respect to the previous month. So my desired dataframe should be something like this:

 0  col1 (id)    col2.   col2c   ….    col100     col100c  col101 (date)  …   col2000 (targeva)
 1      001         653.    653  ….    343.3       343.3    01-02-2017.   …            1
 2      001         673.    23   ….    387.3        44      01-03-2017.   …            0
 3      001         679.     6   ….    401.2        13.9    01-04-2017.   …            1
 4      001         685      6   ….    503.4       102.2    01-05-2017.   …            1
 5      002         343     343  ….    432.5       432.5    01-02-2017.   …            0
 6      002         479.    136  ….    455.3        23.2    01-03-2017.   …            0
 7       …             …         …     ….          …               ….            …            .. 

现在,我在这里有两个问题:1)如何自动识别具有200个特征的那些累积特征?以及如何为每个累积属性添加该额外功能(例如col22c和col100c)?有人知道我该怎么办吗?

Now, I have two problems here: 1) how can I automatically recognize those cumulative features with 200 features? and how to add that extra feature (e.g., col22c and col100c) for each cumulative attribute? Does anyone know how I can handle this?

推荐答案

关于区分两列,您可以使用pandas内置的diff()函数. diff()计算每个元素与前一个元素的差.但是请注意,因为第一个元素没有任何先前的元素,所以diff()结果中的第一个元素将是NaN.因此,我们使用内置函数dropna()删除所有NaN值.

About differentiating two columns, you can use pandas built-in diff() function. diff() calculates the difference of each element compared to the previous one. But note that because the first element doesn't have any previous element, the first element in the result of diff() would be NaN. So we use the built-in function dropna() to drop all NaN values.

但是对于检测累积列,我认为不会有任何办法.您可以找到所有一直在增加(单调)的列,但这并不意味着它们必然是累积的.

But as for detecting the cumulative columns, I don't think there would be any way. You CAN find all the columns that are always increasing (monotonic), but that doesn't mean they are cumulative necessarily.

无论如何检测单调列,您都可以先获取它们的diff().dropna(),然后检查所有这些值是否都是正值:

Anyway for detecting the monotonic columns, you can first get their diff().dropna() and then check if all if these values are positive:

df = some_data_frame
col_diff = df['some_column'].diff().dropna()
is_monotonic = all(col_diff > 0)

请注意,如果您忘记了dropna(),则all(col_diff > 0)的结果将始终为False(因为NaN是伪造的值)

Note that if you forget the dropna(), the result of all(col_diff > 0) would always be False (because NaN is a Falsy value)

这篇关于在数据框中查找累积特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆