滚动总和在不同大小的窗口与更改组R [英] Rolling sums over different size windows with changing groups R

查看:146
本文介绍了滚动总和在不同大小的窗口与更改组R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了关于本网站上所有Q& A的总结,但是我无法理解大部分复杂的代码,所以我的调整技巧是有限的。
我尝试实施一些提出的解决方案,这里 here 这里等等,但是我收到错误或我的电脑崩溃,即使我只使用1,000行和3列。很清楚,我搞砸了代码。



我的数据看起来像这样(前50行通过dput)。总数据集约10万行

 结构(list(pnum = c(4778744,4778744,4778744 4832724,
4840655,4854957,4952026,4832724,4832724,4840655,
4952026,4854957,4952026 5062877,5062877,
4979975,4979975,4979975,5093287,5148510,5093287,
5148510,5093287 5148510,5093287,5148510,5093287,
5148510,5093287,5148510,5093287,5148510,5212120,
5375012 5168079,5375012,5212120,5212120,5168079,
4811345,4851990,4947366,5142672,5317715,4878166
4851990,5142672,5317715,4878166,5142672,5317715,
4878166,5142672,5317715,4878166 ,bb $ b4878166,5185878,4926323,4926323,4926323,4926323,
5185878,4926323,4926323 4926323,4926323,4926323,
4926323,5129067,5136697,5210841,5237700,5237 700,
5237700,5247644,5805912,5828869,5357626,5247644,
5805912,5828869,5357626 c(03859643-1,04488864-4,
04560399-1,03859643-1,03859643-1,03859643-1,03859643-1,
03901719-2,04086089-2,04086089-2,04407934-2,04488864-4,
04952026-3,03859643-1,03859643 -1,03901719-2,03912481-3,
03940277-1,04979975-2,03859643-1,03859643-1,03864113-1
03864113-1,04877300-1,04877300-1,04877300-3,04877300-3,
05040862-3,05040862-3 05093287-4,05093287-4,05093287-6,
05093287-6,03859643-1,03859643-1,03859643-1,03870399-2 ,
03901719-2,03923529-1,04784976-1,03860454-2,03860454-2,
03860454-2,03860454-2 03860454-2,03860454-2,04761567-2,
04870622-2,04870622-2,04870622-2,04878166-2,04878166 -2,
04878166-2,04878166-3,04878166-3,04878166-3,04878166-5,
0487 8166-5,04878166-5,03860454-2,03860454-2,04610004-1,
04734852-2,04734852-3,04761567-2 ,04761567-2,04777587-1,
04835414-1,04878166-2,04926323-10,04926323-5,03860454-2,
03860454-2,03860454-2,03860454-2,05237700-2,05237700-3,
03860454-2,03860454-2,03860454-2 2,03860454-2,04731737-1,
04731737-1,04731737-1,04731737-1),时间= c(1986L,1986L,
1986L,1988L,1988L,1988L,1988L,1988L,1988L,1988L,1988L,
1988L,1988L,1989L,1989L,1989L,1989L,1989L,1989L,1990L,
1990L,1990L, 1990L,1990L,1990L,1990L,1990L,1990L,
1990L,1990L,1990L,1990L,1991L,1991L,1991L,1991L,1991L,
1991L,1991L,1986L,1987L,1987L, 1987L,1987L,1987L,
1987L,1987L,1987L,1987L,1987L,1987L,1987L,1987L,1987L,
1987L,1987L,1987L,1988L,1988L,1988L,1988L,1988L,
1988L,1988L,1988L,1988L,1988L,1988L,1989L,1989L,1990L,
1990,1990L,1990L,1991L,1991L,1991L,1991L,1991L,1991L,
1991L,1991L)).Names = c(pnum,发明者,pryear),row.names = c(1L,
2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,
16L,17L,18L,19L,20L, 21L,22L,23L,24L,25L,26L,27L,28L,
29L,30L,31L,32L,33L,34L,35L,36L,37L,38L,39L,40L,325L,
326L,327L,328L,329L,330L,331L,332L,333L,334L,335L,336L,
337L,338L,339L,340L,341L,342L,343L,344L,345L,346L,347L,
348L,349L,350L,351L,352L,353L,354L,355L,356L,357L,358L,
359L,360L,361L,362L,363L,364L,365L,366L,367L,368L,369L
),class =data.frame)

多个发明家在特定年份 pryear 中的项目 pnum 之间进行协作。我正在寻找三件事:



在@Thierry的评论之后,我更改了数据样本,以确保他指出的问题得到处理。


  1. 每个发明人在当前 pryear 因此,如果当前项目的一年是1977年,我想要从1974年到1976年进行的项目数量包括在内。如果以前没有发生任何事情,理想情况下,结果将为0。 @Alex 此处提供的答案可用于实现此第一目标。但正如评论中讨论的那样,效率并不高(尤其是我的时间范围是从1952年到2010年,超过50,000个发明者)。

  2. 每位发明家的不同发明家的总数在同一时间窗口工作

  3. 如果一个项目有多个发明者,我正在寻找每个发明人与其他发明家合作的次数,这些发明家在相同的过去时间窗口


解决方案

这是一个第一个问题的解决方案。您可以解决其他操作。



第一个解决方案仅使用 dplyr 。您可能遇到大型数据集的问题。

  library(dplyr)
df%>%
inner_join(
df%> ;%
select(发明人,oldyear = pryear),
by =发明人)%>%
过滤器(pryear - 3< = oldyear,oldyear< pryear)%> ;%
group_by(发明人,pryear)%>%
总结(projects = n())

第二个解决方案使用数据库后端使用 dplyr 。这应该能够应对更大的数据集。请注意,代码非常相似。

 库(RSQLite)
库(dplyr)
conn< - dbConnect(SQLite() test)
dbWriteTable(conn,project,df)
src< - src_sqlite(test)
tbl(src,project)%>%
inner_join(
tbl(src,project)%>%
select(发明人,oldyear = pryear),
by =发明人)%>%
过滤器(pryear - 3 <= oldyear,oldyear< pryear)%>%
group_by(发明人,pryear)%>%
总汇(projects = n())%>%
collect()


I have read about all Q&A on rolling sums on this website but I can't make sense of most of the complex code so my tweaking skills are limited. I tried implementing a few solutions proposed, here, here, and here amongst others but either I get errors or my computer crashes, even when I only use 1,000 rows and 3 columns. Hence clearly, I mess up the code.

My data looks like this (first 50 rows via dput). Total dataset is about 100,000 rows

           structure(list(pnum = c("4778744", "4778744", "4778744", "4832724", 
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655", 
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877", 
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120", 
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079", 
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166", 
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323", 
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323", 
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700", 
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644", 
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4", 
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1", 
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4", 
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3", 
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1", 
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3", 
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6", 
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2", 
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2", 
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2", 
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5", 
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1", 
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1", 
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1", 
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L, 
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L, 
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")

Multiple inventors collaborate on a project pnum in a specific year called pryear. I am looking for three things:

After comments from @Thierry I changed the data sample to ensure that the problem he pointed out was dealt with.

  1. The number of projects conducted by each individual inventors in an x (say 3) year window before the current pryear, thus if year of current project is 1977, I want the number of projects conducted from 1974 until 1976 included. If there are no occurrences before, ideally the result would be '0'. the answer provided by @Alex here can be used to achieve this first goal. But as discussed in the comments, it is not highly efficient (especially as my time range is from 1952 to 2010 with over 50,000 inventors).
  2. The total number of different inventors with whom each inventor has worked in that same time window
  3. If a project has multiple inventors, I am looking for the number of times each inventor has collaborated with the other inventors who are working on the current project during the same past time window

解决方案

Here is a solution for you first question. You can solve the other ones as an exercise.

The first solutions uses only dplyr. You will probably run into problems with large datasets.

library(dplyr)
df %>% 
  inner_join(
    df %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n())

The second solutions use dplyr with a database back-end. That should be able to cope with larger datasets. Note that the code is very similar.

library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>% 
  inner_join(
    tbl(src, "project") %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n()) %>% 
  collect()

这篇关于滚动总和在不同大小的窗口与更改组R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆