加入dplyr时如何指定x和y列的名称? [英] How to specify names of columns for x and y when joining in dplyr?
问题描述
test_data< - data.frame(first_name = c(john bill,madison,abby,zzz),
stringsAsFactors = FALSE)
另一个数据框架包含一个清除版本的Kantrowitz名称语料库,用于识别性别。这是一个最小的例子:
kantrowitz< - structure(list(name = c(john madison,abby,thomas),gender = c(M,or,M,any,M)),.Names = c(name ),row.names = c(NA,5L),class = c(tbl_df,tbl,data.frame))
我本来想从 test_data
表中使用 kantrowitz
表。因为我要将它抽象为一个函数 encode_gender
,我不知道将要使用的数据集中的列的名称,所以我可以不保证它将是名称
,如 kantrowitz $ name
。
在基地RI将以这种方式执行合并:
merge(test_data,kantrowitz,by.x = first_names,by.y =name,all.x = TRUE)
正确输出:
first_name gender
1 abby或
2 bill
3 john M
4 madison M
5 zzz< NA>
但是我想在dplyr中执行此操作,因为我使用该包来处理所有其他数据操作。各种 * _ join
函数中的选项的dplyr 只允许我指定一个列名,但是我需要指定两个。我正在寻找这样的东西:
library(dplyr)
#或
left_join(test_data ,kantrowitz,by.x =first_name,by.y =name)
#或
left_join(test_data,kantrowitz,by = c(first_name,name))
使用dplyr执行此类连接的方式是什么?
(没关系,Kantrowitz语料库是识别性别的一个坏方法,我正在努力实现更好的实现,但是我想要首先工作。)
此功能已添加到dplyr v0.3中。您现在可以通过 left_join
(和其他加入函数)中的参数将命名的字符向量传递给,以指定哪些列在每个数据框架中加入。根据原始问题给出的例子,代码将是:
left_join(test_data,kantrowitz,by = c(first_name =name))
I have two data frames that I want to join using dplyr. One is a data frame containing first names.
test_data <- data.frame(first_name = c("john", "bill", "madison", "abby", "zzz"),
stringsAsFactors = FALSE)
The other data frame contains a cleaned up version of the Kantrowitz names corpus, identifying gender. Here is a minimal example:
kantrowitz <- structure(list(name = c("john", "bill", "madison", "abby", "thomas"), gender = c("M", "either", "M", "either", "M")), .Names = c("name", "gender"), row.names = c(NA, 5L), class = c("tbl_df", "tbl", "data.frame"))
I essentially want to look up the gender of the name from the test_data
table using the kantrowitz
table. Because I'm going to abstract this into a function encode_gender
, I won't know the name of the column in the data set that's going to be used, and so I can't guarantee that it will be name
, as in kantrowitz$name
.
In base R I would perform the merge this way:
merge(test_data, kantrowitz, by.x = "first_names", by.y = "name", all.x = TRUE)
That returns the correct output:
first_name gender
1 abby either
2 bill either
3 john M
4 madison M
5 zzz <NA>
But I want to do this in dplyr because I'm using that package for all my other data manipulation. The dplyr by
option to the various *_join
functions only lets me specify one column name, but I need to specify two. I'm looking for something like this:
library(dplyr)
# either
left_join(test_data, kantrowitz, by.x = "first_name", by.y = "name")
# or
left_join(test_data, kantrowitz, by = c("first_name", "name"))
What is the way to perform this kind of join using dplyr?
(Never mind that the Kantrowitz corpus is a bad way to identify gender. I'm working on a better implementation, but I want to get this working first.)
This feature has been added in dplyr v0.3. You can now pass a named character vector to the by
argument in left_join
(and other joining functions) to specify which columns to join on in each data frame. With the example given in the original question, the code would be:
left_join(test_data, kantrowitz, by = c("first_name" = "name"))
这篇关于加入dplyr时如何指定x和y列的名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!