根据具体的行值将列添加到数据帧(2) [英] Add column to dataframe depending on specific row values (2)

查看:200
本文介绍了根据具体的行值将列添加到数据帧(2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这里有一个例子我的data.frame:

  df<  -  read.table(text ='ID日数
33012 9526 4
35004 9526 4 $
37006 9526 4
37008 9526 4
21009 1913 3
24005 1913 3
25009 1913 3
22317 2286 2
37612 2286 2
25009 14329 1
48007 9527 0
88662 9528 0
1845 9528 0
8872 2287 0
49002 1914 0
1664 1915 0',标题= TRUE)

我需要添加一个新列( new_col )到我的data.frame,其中包含从1到4的值。这些 new_col 值必须包括,每一天(x)天(x + 1) )和day(x +2),其中x = 9526,1913,2286,14329(列 Day )。



<我的好消息应该如下:

  ID日数new_col 
33012 9526 4 1
35004 9526 4 1
37006 9526 4 1
37008 9526 4 1
21009 1913 3 2
24005 1913 3 2
25009 1913 3 2
22317 2286 2 3
37612 2286 2 3
25009 14329 1 4
48007 9527 0 1
88662 9528 0 1
1845 9528 0 1
8872 2287 0 3
49002 1914 0 2
1664 1915 0 2

new_col 将会:

  ID日数new_col 
33012 9526 4 1
35004 9526 4 1
37006 9526 4 1
37008 9526 4 1
48007 9527 0 1
88662 9528 0 1
1845 9528 0 1
21009 1913 3 2
24005 1913 3 2
25009 1913 3 2
49002 1914 0 2
1664 1915 0 2
22317 2286 2 3
37612 2286 2 3
8872 2287 0 3
25009 14329 1 4

我的实际数据框比例子更复杂 Count 列中的更多列和更多值)。



@mrbrick在我上一个问题中建议我的代码(根据特定行值将列添加到数据框 )如下:

  x<  -  c(1913,2286,9526,14329)
df $ new_col < - cut(df $ Day,c(-Inf,x,Inf))
df $ new_col< - as.numeric(factor(df $ new_col,levels = unique(df $ new_col)))

但它只适用于第x天,第x -1天和第x -2天。



任何建议都将非常有用。

解决方案

Day 在不同顺序组中的值是这样的:删除 Day 的最后两位数字标识每个组转换剩下的内容以序号作为标签的因素。

  g<  -  df $ Day%/%100 
u< - unique )
transform(df,new_col = factor(g,levels = u,labels = seq_along(u)))

给出:

  ID日数计数new_col 
1 33012 9526 4 1
2 35004 9526 4 1
3 37006 9526 4 1
4 37008 9526 4 1
5 21009 1913 3 2
6 24005 1913 3 2
7 25009 1913 3 2
8 22317 2286 2 3
9 37612 2286 2 3
10 25009 14329 1 4
11 48007 9527 0 1
12 88662 9528 0 1
13 1845 9528 0 1
14 8872 2287 0 3
15 49002 1914 0 2
16 1664 1915 0 2

另一种可能性是用以下之一替换 g < - ... 行:



(a)已知数量的组使用 kmeans 与适当数量的集群:

  g <  -  kmeans(df $ Day,4)$ cluster 

(b)手动设置或手动设置中心,并使用它来启动 kmeans

 中心<  -  c(1913,2286,9526,14329)+ 1 
g< - kmeans(df $ day,centers)$ cluster
或派生中心像这样。如果一天 x 则没有 x-1 x-2 然后 x 必须是序列中的第一个,所以我们选择这样的值,并添加1来获取中心。不同于(a)要求我们知道聚类的数量,(b)哪一个要求我们知道实际的序列,这个序列不需要这些序列。



<$ p中心< - 与(df,唯一(Day [!((Day-1)%%日)&!((Day-2)%in%Day)])+ 1)
g< - kmeans(df $ Day,centers)$ cluster

(d)简化最后一点,或者如果我们保证,如果 x 是序列中的第一个,则x,x + 1和x + 2全部出现,那么我们可以确定,如果有一个no x-1 ,那么 x 是序列中的第一个,所以我们可以简化(c)到:

 #假设x,x + 1,x + 2都显示为每个序列
中心< - with(df,unique(Day [!(Day-1)%in%Day])+ 1)
g< - kmeans(df $ Day,centers)$ cluster

解决方案应该工作,如果组是充分分离和基于在任务中显示的数据似乎他们是。


I have to adjust a code which works perfectly with a different data.frame but with similar conditions.

Here an example of my data.frame:

df <- read.table(text = 'ID    Day Count
    33012   9526    4
    35004   9526    4
    37006   9526    4
    37008   9526    4
    21009   1913    3
    24005   1913    3
    25009   1913    3
    22317   2286    2
    37612   2286    2
    25009   14329   1
    48007   9527    0
    88662   9528    0
    1845    9528    0
    8872    2287    0
    49002   1914    0
    1664    1915    0', header = TRUE)

I need to add a new column (new_col) to my data.frame which contains values from 1 to 4. These new_col values have to include, each one, day (x) day (x +1) and day (x +2), where x = 9526, 1913, 2286, 14329 (column Day).

My output should be the following:

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
22317   2286    2     3
37612   2286    2     3
25009   14329   1     4
48007   9527    0     1
88662   9528    0     1
1845    9528    0     1
8872    2287    0     3
49002   1914    0     2
1664    1915    0     2

The data.frame ordered by new_col will be then:

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
48007   9527    0     1
88662   9528    0     1
1845    9528    0     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
49002   1914    0     2
1664    1915    0     2
22317   2286    2     3
37612   2286    2     3
8872    2287    0     3
25009   14329   1     4

My real data.frame is more complex than the example (i.e. more columns and more values in the Count column).

The code that @mrbrick suggested me in my previous question (Add column to dataframe depending on specific row values) is the following:

x <- c(1913, 2286, 9526, 14329) 
df$new_col <- cut(df$Day, c(-Inf, x, Inf))
df$new_col <- as.numeric(factor(df$new_col, levels=unique(df$new_col)))

But it works only with day x, day x -1 and day x -2.

Any suggestion will be really helpful.

解决方案

Assuming that the Day values in the different sequential groups are such that dropping the last two digits of Day identifies each group convert what is left to a factor with sequence numbers as labels. No packages are used.

 g <- df$Day %/% 100
 u <- unique(g)
 transform(df, new_col = factor(g, levels = u, labels = seq_along(u)))

giving:

      ID   Day Count new_col
1  33012  9526     4       1
2  35004  9526     4       1
3  37006  9526     4       1
4  37008  9526     4       1
5  21009  1913     3       2
6  24005  1913     3       2
7  25009  1913     3       2
8  22317  2286     2       3
9  37612  2286     2       3
10 25009 14329     1       4
11 48007  9527     0       1
12 88662  9528     0       1
13  1845  9528     0       1
14  8872  2287     0       3
15 49002  1914     0       2
16  1664  1915     0       2

Another possibility is to replace the g <- ... line with one of the following:

(a) known number of groups use kmeans with the the appropriate number of clusters:

g <- kmeans(df$Day, 4)$cluster

(b) manually set or manually set centers and use that to initiate kmeans:

centers <-  c(1913, 2286, 9526, 14329) + 1
g <- kmeans(df$day, centers)$cluster

(c) check x-1 and x-2 or derive centers like this. If for a day x there is no x-1 or x-2 then x must be the first in the sequence so we pick out such values and add 1 to get the centers. Unlike (a) which requires that we know the number of clusters and (b) which requires that we know the actual sequences this one does not require that these be known.

centers <- with(df, unique(Day[ ! ((Day-1) %in% Day) & ! ((Day-2) %in% Day) ]) + 1)
g <- kmeans(df$Day, centers)$cluster

(d) simplication of last point or if we are guarantted that if x is the first in the sequence then x, x+1 and x+2 all appear then we can be sure that x is the first in the sequence if there is a no x-1 so we can simplify (c) to:

# assumes x, x+1, x+2 all appear for each sequence
centers <- with(df, unique(Day[ ! (Day-1) %in% Day ]) + 1)
g <- kmeans(df$Day, centers)$cluster

The kmeans solutions should work if the groups are sufficiently separated and based on the data shown in the question it seems that they are.

这篇关于根据具体的行值将列添加到数据帧(2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆