使用 dplyr 中的列表列函数进行变异 [英] Mutate with a list column function in dplyr

查看:21
本文介绍了使用 dplyr 中的列表列函数进行变异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算 tibble 中源向量和比较向量之间的 Jaccard 相似度.

I am trying to calculate the Jaccard similarity between a source vector and comparison vectors in a tibble.

首先,创建一个带有 names_ 字段(字符串向量)的 tibble.使用 dplyr 的 mutate,创建 names_vec,一个列表列,其中每一行现在是一个向量(向量的每个元素都是一个字母).

First, create a tibble with a names_ field (vector of strings). Using dplyr's mutate, create names_vec, a list-column, where each row is now a vector (each element of vector is a letter).

然后,创建一个包含 jaccard_sim 列的新 tibble,用于计算 Jaccard 相似度.

Then, create a new tibble with column jaccard_sim that is supposed to calculate the Jaccard similarity.

source_vec <- c('a', 'b', 'c')

df_comp <- tibble(names_ = c("b d f", "u k g", "m o c"),
              names_vec = strsplit(names_, ' '))

df_comp_jaccard <- df_comp %>%
   dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/length(union(names_vec, source_vec)))

jaccard_sim 中的所有值都为零.但是,如果我们运行这样的程序,我们会得到第一个条目的正确 Jaccard 相似度 0.2:

All the values in jaccard_sim are zero. However, if we run something like this, we get the correct Jaccard similarity of 0.2 for the first entry:

a <- length(intersect(source_vec, df_comp[[1,2]]))
b <- length(union(source_vec, df_comp[[1,2]]))
a/b

推荐答案

你可以简单地添加 rowwise

df_comp_jaccard <- df_comp %>%
  rowwise() %>%
  dplyr::mutate(jaccard_sim = length(intersect(names_vec, source_vec))/
                              length(union(names_vec, source_vec)))

# A tibble: 3 x 3
  names_ names_vec jaccard_sim
   <chr>    <list>       <dbl>
1  b d f <chr [3]>         0.2
2  u k g <chr [3]>         0.0
3  m o c <chr [3]>         0.2

使用 rowwise 你会得到一些在使用 mutate 时所期望的直观行为:对每一行执行这个操作".

Using rowwise you get the intuitive behavior some would expect when using mutate : "do this operation for every row".

不使用 rowwise 意味着您可以利用向量化函数,速度更快,这就是默认设置的原因,但如果您不小心,可能会产生意想不到的结果.

Not using rowwise means you take advantage of vectorized functions, which is much faster, that's why it's the default, but may yield unexpected results if you're not careful.

mutate(或其他 dplyr 函数)按行工作的印象是一种错觉,因为您正在使用矢量化函数,实际上您总是在处理完整的列.

The impression that mutate (or other dplyr functions) works row-wise is an illusion due to the fact you're working with vectorized functions, in fact you're always juggling with full columns.

我会用几个例子来说明:

I'll illustrate with a couple of examples:

有时结果是一样的,使用矢量化函数如paste:

Sometimes the result is the same, with a vectorized function such as paste:

tibble(a=1:10,b=10:1) %>% mutate(X = paste(a,b,sep="_"))
tibble(a=1:10,b=10:1) %>% rowwise %>% mutate(X = paste(a,b,sep="_"))
# # A tibble: 5 x 3
#       a     b     X
#   <int> <int> <chr>
# 1     1     5   1_5
# 2     2     4   2_4
# 3     3     3   3_3
# 4     4     2   4_2
# 5     5     1   5_1

而且有时不一样,有一个没有向量化的函数,比如max:

And sometimes it's different, with a function that is not vectorized, such as max:

tibble(a=1:5,b=5:1) %>% mutate(max(a,b))
# # A tibble: 5 x 3
#       a     b `max(a, b)`
#   <int> <int>       <int>
# 1     1     5           5
# 2     2     4           5
# 3     3     3           5
# 4     4     2           5
# 5     5     1           5

tibble(a=1:5,b=5:1) %>% rowwise %>% mutate(max(a,b))
# # A tibble: 5 x 3
#       a     b `max(a, b)`
#   <int> <int>       <int>
# 1     1     5           5
# 2     2     4           4
# 3     3     3           3
# 4     4     2           4
# 5     5     1           5

请注意,在这种情况下,您不应在现实生活中使用 rowwise,而应使用为此目的进行矢量化的 pmax:

Note that in this case you shouldn't use rowwise in a real life situation, but pmax which is vectorized for this purpose:

tibble(a=1:5,b=5:1) %>% mutate(pmax(a,b))
# # A tibble: 5 x 3
#       a     b `pmax(a, b)`
#   <int> <int>        <int>
# 1     1     5            5
# 2     2     4            4
# 3     3     3            3
# 4     4     2            4
# 5     5     1            5

Intersect 就是这样的函数,你给这个函数输入了一个包含向量的列表列和另一个向量,这两个对象没有交集.

Intersect is such function, you fed this function one list column containing vectors and one other vector, these 2 objects have no intersection.

这篇关于使用 dplyr 中的列表列函数进行变异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆