dplyr:如何以编程方式在列表列表中包含full_join数据帧? [英] dplyr : how-to programmatically full_join dataframes contained in a list of lists?

查看:38
本文介绍了dplyr:如何以编程方式在列表列表中包含full_join数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将与您分享我庞大的数据集的简化版本.这个简化的版本完全尊重我原始数据集的结构,但是包含的列表元素,数据框,变量和观测值比原始数据集要少.

I'll share with you a simplified version of my huge dataset. This simplified version fully respects the structure of my original dataset but contains less list elements, dataframes, variables and observations than the original one.

根据对问题的最高评价:如何制作一个很好的R可重现示例?,我使用 dput(query1)的输出共享我的数据集,通过复制/粘贴以下代码块,您可以立即在R中使用某些内容.R控制台:

According to the most upvoted answer to the question : How to make a great R reproducible example ?, I share my dataset using the output of dput(query1) to give you something that can be immediately used in R by copy/paste the following code block in the R console :

       structure(list(plu = structure(list(year = structure(list(id = 1:3,
    station = 100:102, pluMean = c(0.509068994778059, 1.92866478959912,
    1.09517453602154), pluMax = c(0.0146962179957886, 0.802984389130343,
    2.48170762478472)), .Names = c("id", "station", "pluMean",
"pluMax"), row.names = c(NA, -3L), class = "data.frame"), month = structure(list(
    id = 1:3, station = 100:102, pluMean = c(0.66493845927034,
    -1.3559338786041, 0.195600637750077), pluMax = c(0.503424623872161,
    0.234402501255681, -0.440264545434053)), .Names = c("id",
"station", "pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame"),
    week = structure(list(id = 1:3, station = 100:102, pluMean = c(-0.608295829330578,
    -1.10256919591373, 1.74984007126193), pluMax = c(0.969668266601551,
    0.924426323739882, 3.47460867665884)), .Names = c("id", "station",
    "pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week")), tsa = structure(list(year = structure(list(
    id = 1:3, station = 100:102, tsaMean = c(-1.49060721773042,
    -0.684735418997484, 0.0586655881113975), tsaMax = c(0.25739838787582,
    0.957634817758648, 1.37198023881125)), .Names = c("id", "station",
"tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
    month = structure(list(id = 1:3, station = 100:102, tsaMean = c(-0.684668662999479,
    -1.28087846387974, -0.600175481941456), tsaMax = c(0.962916941685075,
    0.530773351897188, -0.217143593955998)), .Names = c("id",
    "station", "tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
    week = structure(list(id = 1:3, station = 100:102, tsaMean = c(0.376481732842365,
    0.370435880636005, -0.105354927593471), tsaMax = c(1.93833635147645,
    0.81176751708868, 0.744932493064975)), .Names = c("id", "station",
    "tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week"))), .Names = c("plu", "tsa"))

执行此操作后,如果执行 str(query1),您将获得示例数据集的结构为:

After executing this, if you execute str(query1), you'll get the structure of my example dataset as :

    > str(query1)
List of 2
 $ plu:List of 3
  ..$ year :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ pluMean: num [1:3] 0.509 1.929 1.095
  .. ..$ pluMax : num [1:3] 0.0147 0.803 2.4817
  ..$ month:'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ pluMean: num [1:3] 0.665 -1.356 0.196
  .. ..$ pluMax : num [1:3] 0.503 0.234 -0.44
  ..$ week :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ pluMean: num [1:3] -0.608 -1.103 1.75
  .. ..$ pluMax : num [1:3] 0.97 0.924 3.475
 $ tsa:List of 3
  ..$ year :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ tsaMean: num [1:3] -1.4906 -0.6847 0.0587
  .. ..$ tsaMax : num [1:3] 0.257 0.958 1.372
  ..$ month:'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ tsaMean: num [1:3] -0.685 -1.281 -0.6
  .. ..$ tsaMax : num [1:3] 0.963 0.531 -0.217
  ..$ week :'data.frame':   3 obs. of  4 variables:
  .. ..$ id     : int [1:3] 1 2 3
  .. ..$ station: int [1:3] 100 101 102
  .. ..$ tsaMean: num [1:3] 0.376 0.37 -0.105
  .. ..$ tsaMax : num [1:3] 1.938 0.812 0.745

那么它如何读取?我有大列表( query1 ),它由2个 parameters 元素( plu tsa ),这2个 parameters 元素中的每一个都是由3个元素( year month week ),这3个元素中的每一个都是由相同的4个变量列( id station 平均值 max )和完全相同数量的观察值( 3 ).

So how does it reads ? I have big list (query1) made of 2 parameters elements (plu & tsa), each of these 2 parameters elements being a list made of 3 elements (year, month, week), each of these 3 elements being a timeInterval dataframe made of the same 4 variables columns (id, station, mean, max) and exactly the same number of observations (3).

我想通过 id &进行编程 full_join station 所有具有相同名称( year month week ).这意味着我应该以一个包含3个数据帧( year month week 的新列表( query1Changed )结束>),每个包含5列( id station pluMean pluMax tsaMean tsaMax )和3个观察值.从示意图上来说,我需要按以下方式排列数据:

I want to programmatically full_join by id & station all the timeInterval dataframes with the same name (year, month, week). This means that I should end up with a new list (query1Changed) containing 3 dataframes (year, month, week), each of them containing 5 columns (id, station, pluMean, pluMax, tsaMean, tsaMax) and 3 observations. Schematically, I need to arrange data as follows :

按站号和ID进行完全加入:

  • df query1 $ plu $ year 与df query1 $ tsa $ year
  • df query1 $ plu $ month 和df query1 $ tsa $ month
  • df query1 $ plu $ week 与df query1 $ tsa $ week
  • dfquery1$plu$year with df query1$tsa$year
  • dfquery1$plu$month with df query1$tsa$month
  • dfquery1$plu$week with df query1$tsa$week

或以另一种表示形式表达:

Or expressed with another representation :

  • df query1 [[1]] [[1]] 与df query1 [[2]] [[1]]
  • df query1 [[1]] [[2]] 与df query1 [[2]] [[2]]
  • df query1 [[1]] [[3]] 与df query1 [[2]] [[3]]
  • dfquery1[[1]][[1]] with df query1[[2]][[1]]
  • dfquery1[[1]][[2]] with df query1[[2]][[2]]
  • dfquery1[[1]][[3]] with df query1[[2]][[3]]

并以编程方式表示(n是大列表中元素的总数):

And expressed programmatically (n being the total number of elements of the big list) :

  • df query1 [[i]] [[1]] 与df query1 [[i + 1]] [[1]] ...与df query1 [[n]] [[1]]
  • df query1 [[i]] [[2]] 与df query1 [[i + 1]] [[2]] ...与df query1 [[n]] [[2]]
  • df query1 [[i]] [[3]] 和df query1 [[i + 1]] [[3]] ...和df query1 [[n]] [[3]]
  • dfquery1[[i]][[1]] with df query1[[i+1]][[1]]... with df query1[[n]][[1]]
  • dfquery1[[i]][[2]] with df query1[[i+1]][[2]]... with df query1[[n]][[2]]
  • dfquery1[[i]][[3]] with df query1[[i+1]][[3]]... with df query1[[n]][[3]]

我需要以编程方式实现这一目标,因为在我的真实项目中,我可能会遇到另一个大列表,其中包含两个以上的 parameters 元素和四个以上的变量 timeIntervals 数据帧中的em>列.

I need to achieve this programmatically because in my real project I could encounter another big list with more than 2 parameters elements and more than 4 variables columns in each of their timeIntervals dataframes .

在我的分析中,始终保持不变的事实是,另一个大列表的所有 parameters 元素将始终具有相同数量的 timeIntervals数据框具有相同的名称,并且每个 timeIntervals 数据框将始终具有相同数量的观察值,并始终共享2列名称和值完全相同的列( id & station )

In my analysis, what will always remain the same is the fact that all the parameters elements of another big list will always have the same number of timeIntervals dataframes with the same names and each of these timeIntervals dataframes will always have the same number of observations and always share 2 columns with exactly the same name and same values (id & station)

执行以下代码:

> query1Changed <- do.call(function(...) mapply(bind_cols, ..., SIMPLIFY=F), args = query1)

按预期排列数据.但是,这并不是一个整洁的解决方案,因为我们最终得到了重复的列名( id & station ):

arranges the data as expected. However this is not a neat solution since we end up with repeated column names (id & station) :

> str(query1Changed)
List of 3
 $ year :'data.frame':  3 obs. of  8 variables:
  ..$ id      : int [1:3] 1 2 3
  ..$ station : int [1:3] 100 101 102
  ..$ pluMean : num [1:3] 0.509 1.929 1.095
  ..$ pluMax  : num [1:3] 0.0147 0.803 2.4817
  ..$ id1     : int [1:3] 1 2 3
  ..$ station1: int [1:3] 100 101 102
  ..$ tsaMean : num [1:3] -1.4906 -0.6847 0.0587
  ..$ tsaMax  : num [1:3] 0.257 0.958 1.372
 $ month:'data.frame':  3 obs. of  8 variables:
  ..$ id      : int [1:3] 1 2 3
  ..$ station : int [1:3] 100 101 102
  ..$ pluMean : num [1:3] 0.665 -1.356 0.196
  ..$ pluMax  : num [1:3] 0.503 0.234 -0.44
  ..$ id1     : int [1:3] 1 2 3
  ..$ station1: int [1:3] 100 101 102
  ..$ tsaMean : num [1:3] -0.685 -1.281 -0.6
  ..$ tsaMax  : num [1:3] 0.963 0.531 -0.217
 $ week :'data.frame':  3 obs. of  8 variables:
  ..$ id      : int [1:3] 1 2 3
  ..$ station : int [1:3] 100 101 102
  ..$ pluMean : num [1:3] -0.608 -1.103 1.75
  ..$ pluMax  : num [1:3] 0.97 0.924 3.475
  ..$ id1     : int [1:3] 1 2 3
  ..$ station1: int [1:3] 100 101 102
  ..$ tsaMean : num [1:3] 0.376 0.37 -0.105
  ..$ tsaMax  : num [1:3] 1.938 0.812 0.745

我们可以添加第二个过程来清理"数据,但这不是最有效的解决方案.所以我不想使用这种解决方法.

We could add a second process to "clean" the data but this would not be the most efficient solution. So I don't want to use this workaround.

接下来,我尝试使用dplyr full_join进行相同操作,但没有成功.执行以下代码:

Next, I've tried doing the same using dplyr full_join but with no success. Executing the following code :

> query1Changed <- do.call(function(...) mapply(full_join(..., by = c("station", "id")), ..., SIMPLIFY=F), args = query1)

返回以下错误:

Error in UseMethod("full_join") :
  no applicable method for 'full_join' applied to an object of class "list"

那么,我应该如何编写我的full_join表达式以使其在数据帧上运行?

So, how should I write my full_join expression to make it run on the dataframes ?

或者还有另一种方法可以有效地执行我的数据转换吗?

or is there another way to perform my data transformation efficiently ?

我找到了相关的问题,但是我仍然想不出如何使他们的解决方案适应我的问题.

I've found the related questions but I still can't figure out how to adapt their solutions to my problem.

在堆栈溢出时:-从数据帧列表中合并数据帧[重复] -同时将多个data.frames合并到一个列表中-从map()调用中加入data.frames列表-按索引组合列表列表的元素

在博客上:-使用purrr :: reduce()加入数据帧列表

任何帮助将不胜感激.希望我已经明确说明了我的问题.我仅在2个月前才开始使用R进行编程,因此,如果解决方案显而易见,请放纵自己;)

推荐答案

首先,感谢您对问题所在以及解决方案所需的要求进行了非常详尽的描述.

First of all, thanks for posting a really great description of what your problem is and which requirements you need for your solution.

首先,我将使用 purrr :: map2 创建一个函数,该函数接受两个数据帧列表并将其并行连接.也就是说,它将 plu 的第一个数据帧与 tsa 的第一个...结合在一起,将 plu 的最后一个与 tlast的最后一个数据帧连接在一起.> tsa ,并以列表形式返回结果.

First, I'd use purrr::map2 to create a function that takes two lists of data frames and joins them in parallel. That is, it joins the first data frame of plu with the first of tsa ... the last of plu with the last of tsa, and returns the results as a list.

> join_each = function(x, y) map2(x, y, full_join)
> join_each(query1$plu, query1$tsa)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
  id station  pluMean     pluMax     tsaMean    tsaMax
1  1     100 0.509069 0.01469622 -1.49060722 0.2573984
2  2     101 1.928665 0.80298439 -0.68473542 0.9576348
3  3     102 1.095175 2.48170762  0.05866559 1.3719802

$month
  id station    pluMean     pluMax    tsaMean     tsaMax
1  1     100  0.6649385  0.5034246 -0.6846687  0.9629169
2  2     101 -1.3559339  0.2344025 -1.2808785  0.5307734
3  3     102  0.1956006 -0.4402645 -0.6001755 -0.2171436

$week
  id station    pluMean    pluMax    tsaMean    tsaMax
1  1     100 -0.6082958 0.9696683  0.3764817 1.9383364
2  2     101 -1.1025692 0.9244263  0.3704359 0.8117675
3  3     102  1.7498401 3.4746087 -0.1053549 0.7449325

好吧,当它们只有两个时,它可以工作,但是当有n个data.frames列表时,您希望它可以工作.现在您将需要 purrr :: reduce :

Well, this works when there are only two of them, but you want it to work when there are n lists of data.frames. Now you are going to need purrr::reduce:

> reduce(query1, join_each)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
  id station  pluMean     pluMax     tsaMean    tsaMax
1  1     100 0.509069 0.01469622 -1.49060722 0.2573984
2  2     101 1.928665 0.80298439 -0.68473542 0.9576348
3  3     102 1.095175 2.48170762  0.05866559 1.3719802

$month
  id station    pluMean     pluMax    tsaMean     tsaMax
1  1     100  0.6649385  0.5034246 -0.6846687  0.9629169
2  2     101 -1.3559339  0.2344025 -1.2808785  0.5307734
3  3     102  0.1956006 -0.4402645 -0.6001755 -0.2171436

$week
  id station    pluMean    pluMax    tsaMean    tsaMax
1  1     100 -0.6082958 0.9696683  0.3764817 1.9383364
2  2     101 -1.1025692 0.9244263  0.3704359 0.8117675
3  3     102  1.7498401 3.4746087 -0.1053549 0.7449325

它计算 join_each(query1 [[1]],query1 [[2]])%&%;%join_each(query1 [[3]])...%>%join_each(query1 [[n]]).

更新:以下单行代码执行的操作相同: reduce(query1,map2,full_join).但是,它不那么可读.

Update: The following one-liner does the same: reduce(query1, map2, full_join). It isn't as readable, though.

这篇关于dplyr:如何以编程方式在列表列表中包含full_join数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆