处理R/tidyr中的大量嵌套数据集 [英] handling enormous nested datasets in R / tidyr

查看:47
本文介绍了处理R/tidyr中的大量嵌套数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的数据集,我以前用C和Perl处理过(我已经老了),而且我一直在使用Python.我想将我所采用的相当串行的方法与R中可能更多的并行/矢量化方法进行比较,并且dplyr/tidyr看起来很有希望,但是我正在努力.

I have a complex data set that I've previously handled in C and Perl (I'm old) and that I've been working on with Python. I'd like to compare the fairly serial approaches I've taken there with potentially more parallel / vectorized approaches in R, and dplyr/tidyr look promising, but I'm spinning my wheels.

数据集来自单词识别的神经网络模拟.该模型的词典为1000个单词.对于给定的一组参数,我们运行1000个模拟-每个模拟作为输入.给定100个时间步长的输入,输出是1000个单词中每个单词的激活值.更为复杂的是,每个单词检测器在不同的空间对齐方式下有30个副本.因此,完整的数据集包括1000个模拟(每个单词一个输入),每个模拟具有100,000行(1000个单词中的每个单词有100个时间步长)和30列(每个单词的副本一个).例如,某些行可能看起来像这样:

The dataset comes from a neural network simulation of word recognition. The model has a lexicon of 1000 words. For a given set of parameters, we run 1000 simulations -- one with each word as the input. The output is the activation values of each of the 1000 words given the input for 100 time steps. A further complication is that there are 30 copies of each word detector at different spatial alignments. So a full dataset is 1000 simulations (one per word as input), with each simulation having 100,000 rows (100 time steps for each of the 1000 words) and 30 columns (one for each copy of a word). For example, some rows might look like this:

Input,Time,Word,Copy1,Copy2,Copy3..Copy30
ark,1,ark,0.00,0.00,0.00..0.00
ark,1,ad,0.00,0.00,0.00..0.00
ark,1,bark,0.00,0.00,0.00..0.00
...
ark,50,ark,0.00,0.10,0.25..0.00
ark,50,ad,0.00,0.01,0.05..0.00
ark,50,bark,0.01,0.07,0.00..0.00
...
ark,100,ark,0.00,0.17,0.55..0.00
ark,100,ad,0.00,0.03,0.11..0.00
ark,100,bark,0.05,0.20,0.00..0.00

因此,在这些行中,输入了ark,并在时间1、50和100选择了ark,ad和树皮的激活值.ark和ad的第2和第3副本最终至少被激活了,当输入对齐到位置3时.树皮的第1和第2副本被激活,因为这是树皮的深色"部分与树皮模板最一致的地方.

So in these lines, ark is the input, and activation values for ark, ad, and bark are selected at Times 1, 50, and 100. The 2nd and 3rd copies of ark and ad eventually become at least somewhat activated, as the input is aligned at position 3. The 1st and 2nd copies for bark become activated because that's where the 'ark' portion of bark aligns best with the bark templates.

当我使用C或Perl串行执行此操作时,方法基本上类似于此伪代码:

When I do this serially in C or Perl, the approach is basically like this pseudocode:

for each input 1..1000 {
    read in the 1000x100 lines for the simulation with this input
    for each word 1..1000 {
        select the column for word with maximum value
    }
    reduce to top X words based on max values over all time steps
}

第一个目标是将每个单词的每个时间步长的30个值减少到每个单词的每个时间步长的1个值,同时对保留哪个副本进行编码.在上面的示例数据中,选择的数据将减少到以下内容,并带有一个新字段,指示保留了哪个副本(方舟和ad的Copy3,因为那是具有最大值2的列,而树皮的Copy2具有最大值):

So the first goal is to reduce from 30 values per time step per word to 1 value per time step per word, while coding which copy was kept. In the example data above, data selected would be reduced to the following, with a new field indicating which Copy was retained (Copy3 for ark and ad, as that's the column with max values for those 2, but Copy2 for bark):

Input,Time,Word,CopyKept,Value
ark,1,ark,Copy3,0.00
ark,1,ad,Copy3,0.00
ark,1,bark,Copy2,0.00
...
ark,50,ark,Copy3,0.25
ark,50,ad,Copy3,0.05
ark,50,bark,Copy2,0.07
...
ark,100,ark,Copy3,0.55
ark,100,ad,Copy3,0.11
ark,100,bark,Copy2,0.20

如果X(保留的单词数)设置为2,则数据将进一步减少为仅包括方舟和树皮(因为示例中包含的行中的最大值为ark = 0.55,bark = 0.20,ad= 0.11).

If X (number of words to retain) were set to 2, the data would be further reduced to only include ark and bark (since max values in the rows included in the example were ark=0.55, bark=0.20, ad=0.11).

Input,Time,Word,CopyKept,Value
ark,1,ark,Copy3,0.00
ark,1,bark,Copy2,0.00
...
ark,50,ark,Copy3,0.25
ark,50,bark,Copy2,0.07
...
ark,100,ark,Copy3,0.55
ark,100,bark,Copy2,0.20

数据文件很大,但看起来vroom可以处理它们(因此我可以从vroom对象处理数据块,而不是读取数据块中的CSV输入).我已经尝试过使用group_by的方法,但是我似乎并没有做对,也找不到类似的例子.我将不胜感激.

The data files are enormous, but it looks as though vroom can handle them (so I can process chunks from the vroom object rather than reading the CSV input in chunks). I've tried using group_by kinds of approaches, but I don't seem to be getting it right, and I can't find examples that are similarly complex. I'd be grateful for any advice.

更新:@Dan Chatiel的解决方案与我需要的解决方案非常接近,但并不完全相同.他的最终输出是:

UPDATE: @Dan Chatiel's solution is very close to what I need, but not quite. His final output is:

#> # A tibble: 6 x 5
#> # Groups:   Time [3]
#>   Input  Time Word  copy_name copy_value
#>   <chr> <int> <chr> <chr>          <dbl>
#> 1 ark       1 ad    Copy1           0   
#> 2 ark       1 bark  Copy1           0   
#> 3 ark      50 ark   Copy3           0.25
#> 4 ark      50 bark  Copy2           0.07
#> 5 ark     100 ark   Copy3           0.55
#> 6 ark     100 bark  Copy2           0.2

在每个时间步找到每个单词的顶部副本之后,解决方案在每个时间步都找到顶部的两个项目.这意味着我们在不同的时间步长对特定单词有不同的CopyX,在不同的时间步长有不同的单词.

Where after finding the top copy for each word at each timestep, the solution finds the top two items at each time step. This means we have different CopyXs at different time steps for particular words, and different words at different time steps.

我要做的是首先根据列最大值为整个时间序列选择每个单词的一列(副本).因此,例如,基于整个时间序列,我们将为方舟"选择"Copy3",为树皮"选择"Copy2",然后在每个时间步长将数据缩减为那些方舟"和树皮"的副本.

What I am trying to do is first select 1 column (copy) of each word for the entire time series based on column maxima. So for example, based on the entire time series, we would choose Copy3 for ark and Copy2 for bark, and then reduce the data to those copies of ark and bark at every time step.

当我们选择前2个项目时,它应该基于整个时间序列的最大值.因此,由于方舟和树皮在整个时间序列中均达到前两个最大值,因此我们希望在每个时间都选择它们.最终输出将是:

When we get to the point of choosing the top 2 items, it should be based on maxima over the whole time series. So because ark and bark achieve the top two maximal values over the entire time series, we want to select those at every Time. The final output would be:

#> # A tibble: 6 x 5
#> # Groups:   Time [3]
#>   Input  Time Word  copy_name copy_value
#>   <chr> <int> <chr> <chr>          <dbl>
#> 1 ark       1 ark   Copy3           0   
#> 2 ark       1 bark  Copy2           0   
#> 3 ark      50 ark   Copy3           0.25
#> 4 ark      50 bark  Copy2           0.07
#> 5 ark     100 ark   Copy3           0.55
#> 6 ark     100 bark  Copy2           0.2

此外,如果我们添加其他输入,则解决方案的行为似乎也不一致:

Also, the solution doesn't seem to behave consistently if we add another input:

library(tidyverse)
library(furrr) # parallel processing speeds things up a lot
library(tictoc)
future::plan(multiprocess) # for parallel

df = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,0.00
ark,1,ad,0.00,0.00,0.00,0.00
ark,1,bark,0.00,0.00,0.00,0.00
ark,50,ark,0.00,0.10,0.25,0.00
ark,50,ad,0.00,0.01,0.05,0.00
ark,50,bark,0.01,0.07,0.00,0.00
ark,100,ark,0.00,0.17,0.55,0.00
ark,100,ad,0.00,0.03,0.11,0.00
ark,100,bark,0.05,0.20,0.00,0.00
bark,1,ark,0.00,0.00,0.00,0.00 # corrected Word bark to ark; thanks,Dan
bark,1,ad,0.00,0.00,0.00,0.00
bark,1,bark,0.00,0.00,0.00,0.00
bark,50,ark,0.00,0.03,0.09,0.00
bark,50,ad,0.00,0.01,0.05,0.00
bark,50,bark,0.09,0.28,0.00,0.00
bark,100,ark,0.00,0.08,0.32,0.00
bark,100,ad,0.00,0.03,0.11,0.00
bark,100,bark,0.21,0.60,0.00,0.00
")

#step 1: select the column

#or maybe try to use max.col(ties.method="last") if it is important
tic()
df_red = df %>% 
  select(starts_with("Copy")) %>% 
  # future_pmap_dfr gives us parallel version of pmap_dfr
  future_pmap_dfr(~{
    name = names(which.max(c(...)))
    tibble(copy_name=name, copy_value=max(..., na.rm=TRUE))
  })
toc()

df2 = df %>% 
  select(-starts_with("Copy")) %>% 
  cbind(df_red)

df2
# Input Time Word copy_name copy_value
# 1    ark    1  ark     Copy1       0.00
# 2    ark    1   ad     Copy1       0.00
# 3    ark    1 bark     Copy1       0.00
# 4    ark   50  ark     Copy3       0.25
# 5    ark   50   ad     Copy3       0.05
# 6    ark   50 bark     Copy2       0.07
# 7    ark  100  ark     Copy3       0.55
# 8    ark  100   ad     Copy3       0.11
# 9    ark  100 bark     Copy2       0.20
# 10  bark    1 bark     Copy1       0.00
# 11  bark    1   ad     Copy1       0.00
# 12  bark    1 bark     Copy1       0.00
# 13  bark   50  ark     Copy3       0.09
# 14  bark   50   ad     Copy3       0.05
# 15  bark   50 bark     Copy2       0.28
# 16  bark  100  ark     Copy3       0.32
# 17  bark  100   ad     Copy3       0.11
# 18  bark  100 bark     Copy2       0.60

#step 2: select words
n_words = 2
df2 %>%
  group_by(Input) %>% 
  filter(rank(copy_value, ties.method="first") >= n_words) %>%
  group_by(Input)
# A tibble: 16 x 5
# Groups:   Input [2]
# Input  Time Word  copy_name copy_value
# <fct> <int> <fct> <chr>          <dbl>
# 1 ark       1 ad    Copy1           0   
# 2 ark       1 bark  Copy1           0   
# 3 ark      50 ark   Copy3           0.25
# 4 ark      50 ad    Copy3           0.05
# 5 ark      50 bark  Copy2           0.07
# 6 ark     100 ark   Copy3           0.55
# 7 ark     100 ad    Copy3           0.11
# 8 ark     100 bark  Copy2           0.2 
# 9 bark      1 ad    Copy1           0   
# 10 bark     1 bark  Copy1           0   
# 11 bark    50 ark   Copy3           0.09
# 12 bark    50 ad    Copy3           0.05
# 13 bark    50 bark  Copy2           0.28
# 14 bark   100 ark   Copy3           0.32
# 15 bark   100 ad    Copy3           0.11
# 16 bark   100 bark  Copy2           0.6 

但是通过按输入"和时间"进行分组,我们可以在每个时间步上返回2个选择:

But we can get back to 2 selections per time step by grouping by Input and Time:

df2 %>%
  group_by(Input, Time) %>% 
  filter(rank(copy_value, ties.method="first") >= n_words) %>%
  group_by(Input)
# A tibble: 12 x 5
# Groups:   Input [2]
# Input  Time Word  copy_name copy_value
# <fct> <int> <fct> <chr>          <dbl>
# 1 ark       1 ad    Copy1           0   
# 2 ark       1 bark  Copy1           0   
# 3 ark      50 ark   Copy3           0.25
# 4 ark      50 bark  Copy2           0.07
# 5 ark     100 ark   Copy3           0.55
# 6 ark     100 bark  Copy2           0.2 
# 7 bark      1 ad    Copy1           0   
# 8 bark      1 bark  Copy1           0   
# 9 bark     50 ark   Copy3           0.09
# 10 bark     50 bark  Copy2           0.28
# 11 bark    100 ark   Copy3           0.32
# 12 bark    100 bark  Copy2           0.6 

但是同样,对于这种情况下的两个输入,我想要的最终输出将是每个时间步的树皮的Copy2和方舟的Copy3.

But again, for both Inputs in this case, the final output I'm looking for would be Copy2 of bark and Copy3 of ark at every time step.

编辑-我意识到样本数据需要包含一些条件,以捕获可能导致错误的方式.在这个新的样本数据中,在时间50处,基于最大值的排名顺序与在时间100处的排名顺序不同.请注意,实际数据中的最大值可能早于时间100出现,因此我们不能只检查最大值最后一步.

EDIT-- I realized the sample data needs to include conditions that catch the possible ways this could go wrong. In this new sample data, at Time 50, the rank order based on maxima is different from what it is at Time 100. Note that maxima may occur earlier than Time 100 in real data, so we can't just check the maxima at the last time step.

f = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,0.00
ark,1,ad,0.00,0.00,0.00,0.00
ark,1,bark,0.00,0.00,0.00,0.00
ark,50,ark,0.00,0.10,0.05,0.00
ark,50,ad,0.00,0.05,0.03,0.00
ark,50,bark,0.07,0.06,0.00,0.00
ark,100,ark,0.00,0.17,0.55,0.00  # max for ark:ark, Time 100, Copy3
ark,100,ad,0.00,0.03,0.11,0.00   # max for ark:ad, Time 100, Copy3
ark,100,bark,0.05,0.20,0.00,0.00 # max for ark:bark, Time 100, Copy2
bark,1,ark,0.00,0.00,0.00,0.00        
bark,1,ad,0.00,0.00,0.00,0.00
bark,1,bark,0.00,0.00,0.00,0.00
bark,50,ark,0.00,0.03,0.09,0.00
bark,50,ad,0.00,0.05,0.03,0.00
bark,50,bark,0.2,0.75,0.00,0.00   # max for bark:bark, Time 50, Copy2
bark,100,ark,0.00,0.08,0.32,0.00  # max for bark:ark, Time 100, Copy3
bark,100,ad,0.00,0.03,0.11,0.00   # max for bark:ad, Time 100, Copy3
bark,100,bark,0.21,0.60,0.00,0.00 
") %>% arrange(Input,Time,Word)

对于此数据,所需输出为:

# A tibble: 12 x 5
# Groups:   Input, Time [6]
#   Input  Time Word  copy_name copy_value
#   <fct> <int> <fct> <chr>          <dbl>
#  1 ark       1 ark   Copy3          0   
#  2 ark       1 bark  Copy2          0   
#  3 ark      50 ark   Copy3          0.05 
#  4 ark      50 bark  Copy2          0.06
#  5 ark     100 ark   Copy3          0.55
#  6 ark     100 bark  Copy2          0.2 
#  7 bark      1 ark   Copy3          0   
#  8 bark      1 bark  Copy2          0   
#  9 bark     50 ark   Copy3          0.09
# 10 bark     50 bark  Copy2          0.75 
# 11 bark    100 ark   Copy3          0.32
# 12 bark    100 bark  Copy2          0.6 

在到目前为止建议的解决方案中,存在两个问题:在时间1,由于所有值均为0,因此为所有项目选择了复制1"或复制30".在以后的Times中,选择的是该时间具有该单词最大值的复制",而不是"Copy"对应于所有时间中该单词最大值的复制.

In the solutions suggested so far, there are 2 problems: at Time 1, since all values are 0, either Copy1 or Copy30 gets selected for all items; at later Times, the Copy with the maximum for a word at that Time is selected, rather than the Copy the corresponds to the maximum for the word over all Times.

请注意,以下内容标识了最大值.现在要弄清楚如何基于此选择CopyXs ...

Note that the following identifies the maxima. Now to figure out how to select CopyXs based on this...

df %>% 
  pivot_longer(starts_with("Copy"), names_to="copy_name", values_to="copy_value") %>% 
  group_by(Input, Word) %>% 
  filter(rank(copy_value, ties.method="first") == n()) %>%
  group_by(Input, Time)

# A tibble: 6 x 5
# Groups:   Input, Time [3]
#  Input  Time Word  copy_name copy_value
#  <fct> <int> <fct> <chr>          <dbl>
# 1 ark     100 ad    Copy3           0.11
# 2 ark     100 ark   Copy3           0.55
# 3 ark     100 bark  Copy2           0.2 
# 4 bark     50 bark  Copy2           0.75
# 5 bark    100 ad    Copy3           0.11
# 6 bark    100 ark   Copy3           0.32

有什么建议吗?谢谢.

推荐答案

此答案已被重写,请参阅版本日志以获取记录.

This answer has been rewritten, see the edition log for the record.

此解决方案更适合您的逻辑,但是对 pivot_longer()的调用可能不适用于大型数据集,因为矢量长度可能变得巨大.也许 multidplyr 会派上用场,但我从未使用过,所以我不能帮助您解决这个问题.

This solution is better for your logic, but the call to pivot_longer() might work badly with large datasets as vector length might turn gigantic. Maybe multidplyr would come in handy but I've never used it so I can't help you with this.

旋转之后,我想您的问题就是调整分组:

After pivoting, I guess your problem is just adjusting the grouping:

library(tidyverse)

df = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,1,ark,0.00,0.00,0.00,0.00
ark,1,ad,0.00,0.00,0.00,0.00
ark,1,bark,0.00,0.00,0.00,0.00
ark,50,ark,0.00,0.10,0.25,0.00
ark,50,ad,0.00,0.01,0.05,0.00
ark,50,bark,0.01,0.07,0.00,0.00
ark,100,ark,0.00,0.17,0.55,0.00
ark,100,ad,0.00,0.03,0.11,0.00
ark,100,bark,0.05,0.20,0.00,0.00
bark,1,ark,0.00,0.00,0.00,0.00        #NB: you had another bark here
bark,1,ad,0.00,0.00,0.00,0.00
bark,1,bark,0.00,0.00,0.00,0.00
bark,50,ark,0.00,0.03,0.09,0.00
bark,50,ad,0.00,0.01,0.05,0.00
bark,50,bark,0.09,0.28,0.00,0.00
bark,100,ark,0.00,0.08,0.32,0.00
bark,100,ad,0.00,0.03,0.11,0.00
bark,100,bark,0.21,0.60,0.00,0.00
") %>% arrange(Input,Time,Word)


n_words = 2
df %>% 
  pivot_longer(starts_with("Copy"), names_to="copy_name", values_to="copy_value") %>% 
  group_by(Input, Time, Word) %>% 
  # slice_max(copy_value, with_ties=FALSE) %>% #alternative to filter+rank
  filter(rank(copy_value, ties.method="first") == n()) %>%
  group_by(Input, Time) %>% 
  filter(rank(copy_value, ties.method="first") >= n_words)
#> # A tibble: 12 x 5
#> # Groups:   Input, Time [6]
#>    Input  Time Word  copy_name copy_value
#>    <chr> <int> <chr> <chr>          <dbl>
#>  1 ark       1 ark   Copy30          0   
#>  2 ark       1 bark  Copy30          0   
#>  3 ark      50 ark   Copy3           0.25
#>  4 ark      50 bark  Copy2           0.07
#>  5 ark     100 ark   Copy3           0.55
#>  6 ark     100 bark  Copy2           0.2 
#>  7 bark      1 ark   Copy30          0   
#>  8 bark      1 bark  Copy30          0   
#>  9 bark     50 ark   Copy3           0.09
#> 10 bark     50 bark  Copy2           0.28
#> 11 bark    100 ark   Copy3           0.32
#> 12 bark    100 bark  Copy2           0.6

reprex软件包(v2.0.0)创建于2021-05-01 (v2.0.0)

Created on 2021-05-01 by the reprex package (v2.0.0)

这篇关于处理R/tidyr中的大量嵌套数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆