在不同大小的窗口上滚动总和与更改组R [英] Rolling sums over different size windows with changing groups R

查看:151
本文介绍了在不同大小的窗口上滚动总和与更改组R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在这个网站上阅读了所有的Q& A,但我不能理解大部分复杂的代码,所以我的调整技巧有限。
我尝试实现一些建议的解决方案,这里这里这里,但是我收到错误或我的电脑崩溃,即使我只使用1000行和3列。因此,显然,我搞砸了代码。

I have read about all Q&A on rolling sums on this website but I can't make sense of most of the complex code so my tweaking skills are limited. I tried implementing a few solutions proposed, here, here, and here amongst others but either I get errors or my computer crashes, even when I only use 1,000 rows and 3 columns. Hence clearly, I mess up the code.

我的数据看起来像这样(前50行通过dput)。总数据集约为100,000行

My data looks like this (first 50 rows via dput). Total dataset is about 100,000 rows

           structure(list(pnum = c("4778744", "4778744", "4778744", "4832724", 
"4840655", "4854957", "4952026", "4832724", "4832724", "4840655", 
"4952026", "4854957", "4952026", "4979975", "5062877", "5062877", 
"4979975", "4979975", "4979975", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5093287", 
"5148510", "5093287", "5148510", "5093287", "5148510", "5212120", 
"5375012", "5168079", "5375012", "5212120", "5212120", "5168079", 
"4811345", "4851990", "4947366", "5142672", "5317715", "4878166", 
"4851990", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5142672", "5317715", "4878166", "5142672", "5317715", 
"4878166", "5185878", "4926323", "4926323", "4926323", "4926323", 
"5185878", "4926323", "4926323", "4926323", "4926323", "4926323", 
"4926323", "5129067", "5136697", "5210841", "5237700", "5237700", 
"5237700", "5247644", "5805912", "5828869", "5357626", "5247644", 
"5805912", "5828869", "5357626"), ID = c("03859643-1", "04488864-4", 
"04560399-1", "03859643-1", "03859643-1", "03859643-1", "03859643-1", 
"03901719-2", "04086089-2", "04086089-2", "04407934-2", "04488864-4", 
"04952026-3", "03859643-1", "03859643-1", "03901719-2", "03912481-3", 
"03940277-1", "04979975-2", "03859643-1", "03859643-1", "03864113-1", 
"03864113-1", "04877300-1", "04877300-1", "04877300-3", "04877300-3", 
"05040862-3", "05040862-3", "05093287-4", "05093287-4", "05093287-6", 
"05093287-6", "03859643-1", "03859643-1", "03859643-1", "03870399-2", 
"03901719-2", "03923529-1", "04784976-1", "03860454-2", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04761567-2", 
"04870622-2", "04870622-2", "04870622-2", "04878166-2", "04878166-2", 
"04878166-2", "04878166-3", "04878166-3", "04878166-3", "04878166-5", 
"04878166-5", "04878166-5", "03860454-2", "03860454-2", "04610004-1", 
"04734852-2", "04734852-3", "04761567-2", "04761567-2", "04777587-1", 
"04835414-1", "04878166-2", "04926323-10", "04926323-5", "03860454-2", 
"03860454-2", "03860454-2", "03860454-2", "05237700-2", "05237700-3", 
"03860454-2", "03860454-2", "03860454-2", "03860454-2", "04731737-1", 
"04731737-1", "04731737-1", "04731737-1"), Time = c(1986L, 1986L, 
1986L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1989L, 1989L, 1989L, 1989L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 1990L, 
1990L, 1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L, 1986L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 1987L, 
1987L, 1987L, 1987L, 1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 
1988L, 1988L, 1988L, 1988L, 1988L, 1988L, 1989L, 1989L, 1990L, 
1990L, 1990L, 1990L, 1991L, 1991L, 1991L, 1991L, 1991L, 1991L, 
1991L, 1991L)), .Names = c("pnum", "inventor", "pryear"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 325L, 
326L, 327L, 328L, 329L, 330L, 331L, 332L, 333L, 334L, 335L, 336L, 
337L, 338L, 339L, 340L, 341L, 342L, 343L, 344L, 345L, 346L, 347L, 
348L, 349L, 350L, 351L, 352L, 353L, 354L, 355L, 356L, 357L, 358L, 
359L, 360L, 361L, 362L, 363L, 364L, 365L, 366L, 367L, 368L, 369L
), class = "data.frame")

code> publisher 在特定年份 pryear 上合作项目 pnum 。我正在寻找三件事:

Multiple inventors collaborate on a project pnum in a specific year called pryear. I am looking for three things:

来自@Thierry的意见后,我更改了数据样本,以确保他指出的问题得到了解决。

After comments from @Thierry I changed the data sample to ensure that the problem he pointed out was dealt with.


  1. 每个发明人在当前 pryear ,因此如果当前项目的年份是1977年,我想要从1974年到1976年进行的项目数量包括。如果之前没有出现,理想情况下结果将为0。由@Alex 此处提供的答案可用于实现此第一个目标。但是如评论中所讨论的,它不是高效的(特别是因为我的时间范围是从1952到2010年,超过50,000发明者)。

  2. 每个发明者

  3. 如果一个项目有多个发明人,我正在寻找每个发明人与正在处理当前项目的其他发明人合作的次数过去的时间窗口

  1. The number of projects conducted by each individual inventors in an x (say 3) year window before the current pryear, thus if year of current project is 1977, I want the number of projects conducted from 1974 until 1976 included. If there are no occurrences before, ideally the result would be '0'. the answer provided by @Alex here can be used to achieve this first goal. But as discussed in the comments, it is not highly efficient (especially as my time range is from 1952 to 2010 with over 50,000 inventors).
  2. The total number of different inventors with whom each inventor has worked in that same time window
  3. If a project has multiple inventors, I am looking for the number of times each inventor has collaborated with the other inventors who are working on the current project during the same past time window


推荐答案

第一个解决方案只使用 dplyr 。你可能会遇到大型数据集的问题。

The first solutions uses only dplyr. You will probably run into problems with large datasets.

library(dplyr)
df %>% 
  inner_join(
    df %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n())

第二个解决方案使用 dplyr 和数据库后端。这应该能够处理更大的数据集。注意代码非常相似。

The second solutions use dplyr with a database back-end. That should be able to cope with larger datasets. Note that the code is very similar.

library(RSQLite)
library(dplyr)
conn <- dbConnect(SQLite(), "test")
dbWriteTable(conn, "project", df)
src <- src_sqlite("test")
tbl(src, "project") %>% 
  inner_join(
    tbl(src, "project") %>% 
      select(inventor, oldyear = pryear), 
    by = "inventor") %>% 
  filter(pryear - 3 <= oldyear, oldyear < pryear) %>% 
  group_by(inventor, pryear) %>% 
  summarise(projects = n()) %>% 
  collect()

这篇关于在不同大小的窗口上滚动总和与更改组R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆