在嵌套的小标题列上计算函数? [英] Calculate function on a column of nested tibbles?
问题描述
我有一个带有一列小标题的数据框。
这是我的部分数据:
日期时间uuid数据
2018-06-23 18 :25:24 0b27ea5fad61c99d< tibble>
2018-06-23 18:25:38 0b27ea5fad61c99d< tibble>
2018-06-23 18:26:01 0b27ea5fad61c99d< tibble>
2018-06-23 18:26:23 0b27ea5fad61c99d< tibble>
2018-06-23 18:26:37 0b27ea5fad61c99d< tibble>
2018-06-23 18:27:00 0b27ea5fad61c99d< tibble>
2018-06-23 18:27:22 0b27ea5fad61c99d< tibble>
2018-06-23 18:27:39 0b27ea5fad61c99d< tibble>
2018-06-23 18:28:06 0b27ea5fad61c99d< tibble>
2018-06-23 18:28:30 0b27ea5fad61c99d< tibble>
这是我的功能:
jaccard<-function(vector1,vector2){
return(length(intersect(vector1,vector2)))/
length(union(vector1, vector2)))
}
我的数据列由一个小标题组成字符列:
联系人
5646
65748
115
498456
35135
我的目标是计算数据列中每2个连续小节之间的抽搐。 / p>
我尝试过:
df%>%mutate(j = jaccard( data,lag(data,1)))
,但由于某种原因似乎无法正常工作。
我知道我已经接近了,
原因是未编写 jaccard
函数处理向量参数。如您所知,用作 mutate
的一部分的函数会接收到数据向量(在OP的情况下, jaccard
函数来处理vector(tibbles的向量)的参数,因此结果将不符合预期。
最简单的解决方法是对 jaccard
函数进行矢量化处理,使其可以处理矢量参数。一次可以使用 Vectorize
将函数转换为:
#Function
jaccard<-函数(vector1,vector2){
return(length(intersect(vector1,vector2))/
length(union(vector1,vector2)))
}
#jaccard函数的矢量化版本
jaccardV<-Vectorize(jaccard)
库(dplyr)
df%&%;%
mutate(j = jaccardV(data,lag(data,1)))
#日期时间uuid数据j
#1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646,65748 ,115,498456,35135 0.0000000
#2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646,65748 0.4000000
#3 2018-06-23 18:26:01 0b27ea5fad61c99d 5646,65748,115 0.6666667
#4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646 0.3333333
#5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646,65748 0.5000000
#6 2018-06 -23 18:27:00 0b27ea5fad61c99d 5646,65748,115,498456,35135 0.4000000
#7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646,65748 0.4000000
#8 2018-06-23 18: 27:39 0b27ea5fad61c99d 5646,65748,115 0.6666667
#9 2018-06-23 18:28:06 0b27ea5fad61c99d 5646 0.3333333
#10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646,65748 0.5000000
数据:
df<-read.table(text =
date time uuid
2018-06-23 18:25:24 0b27ea5fad61c99d
2018 -06-23 18:25:38 0b27ea5fad61c99d
2018-06-23 18:26:01 0b27ea5fad61c99d
2018-06-23 18:26:23 0b27ea5fad61c99d
2018-06-23 18 :26:37 0b27ea5fad61c99d
2018-06-23 18:27:00 0b27ea5fad61c99d
2018-06-23 18:27:22 0b27ea5fad61c99d
2018-06-23 18:27:39 0b27ea5fad61c99d
2018-06-2 3 18:28:06 0b27ea5fad61c99d
2018-06-23 18:28:30 0b27ea5fad61c99d,
标头= TRUE,stringsAsFactors = FALSE)
t1<-tibble(接触= c(5646,65748,115,498456,35135))
t2<-tibble(接触= c(5646,65748))
t3<-tibble(接触= c(5646, 65748,115))
t4<-tibble(触点= c(5646))
t5<-tibble(contacts = c(5646,65748))
df $ data<-c(t1,t2,t3,t4,t5)
df
#日期时间uuid数据
#1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646,65748,115,498456,35135
#2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646,65748
#3 2018-06-23 18:26: 01 0b27ea5fad61c99d 5646,65748,115
#4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646
#5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646,65748
# 6 2018-06-23 18:27:00 0b27ea5fad61c99d 5646,65748,115,498456,3513 5
#7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646,65748
#8 2018-06-06 18:27:39 0b27ea5fad61c99d 5646,65748,115
#9 2018 -06-23 18:28:06 0b27ea5fad61c99d 5646
#10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646,65748
I have a dataframe with a column of tibbles. Here is a portion of my data:
date time uuid data
2018-06-23 18:25:24 0b27ea5fad61c99d <tibble>
2018-06-23 18:25:38 0b27ea5fad61c99d <tibble>
2018-06-23 18:26:01 0b27ea5fad61c99d <tibble>
2018-06-23 18:26:23 0b27ea5fad61c99d <tibble>
2018-06-23 18:26:37 0b27ea5fad61c99d <tibble>
2018-06-23 18:27:00 0b27ea5fad61c99d <tibble>
2018-06-23 18:27:22 0b27ea5fad61c99d <tibble>
2018-06-23 18:27:39 0b27ea5fad61c99d <tibble>
2018-06-23 18:28:06 0b27ea5fad61c99d <tibble>
2018-06-23 18:28:30 0b27ea5fad61c99d <tibble>
And this is my function:
jaccard <- function(vector1, vector2) {
return(length(intersect(vector1, vector2)) /
length(union(vector1, vector2)))
}
My data column consists of tibbles with one column of characters:
contacts
5646
65748
115
498456
35135
My goal is to calculate jaccard between each 2 consecutive tibbles in the data column.
I have tried:
df %>% mutate(j = jaccard(data, lag(data, 1)))
but it doesn't seem to work for some reason.
I know I am close, please advise.
The reason is that jaccard
function is not written to handle vector arguments. As you know that functions used as part of mutate
receive a vector of data (vector of 10 tibbles
in case of OP's example). Now, since jaccard
function is not written to handle arguments of vector(vector of tibbles) the result will not meet expectation.
The easiest fix can be to vectorise jaccard
function so that it can handle vector arguments. Once can use Vectorize
to convert the function as:
# Function
jaccard <- function(vector1, vector2) {
return(length(intersect(vector1, vector2)) /
length(union(vector1, vector2)))
}
# Vectorised version of jaccard function
jaccardV <- Vectorize(jaccard)
library(dplyr)
df %>%
mutate(j = jaccardV(data, lag(data, 1)))
# date time uuid data j
# 1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.0000000
# 2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646, 65748 0.4000000
# 3 2018-06-23 18:26:01 0b27ea5fad61c99d 5646, 65748, 115 0.6666667
# 4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646 0.3333333
# 5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646, 65748 0.5000000
# 6 2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.4000000
# 7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646, 65748 0.4000000
# 8 2018-06-23 18:27:39 0b27ea5fad61c99d 5646, 65748, 115 0.6666667
# 9 2018-06-23 18:28:06 0b27ea5fad61c99d 5646 0.3333333
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646, 65748 0.5000000
Data:
df <- read.table(text="
date time uuid
2018-06-23 18:25:24 0b27ea5fad61c99d
2018-06-23 18:25:38 0b27ea5fad61c99d
2018-06-23 18:26:01 0b27ea5fad61c99d
2018-06-23 18:26:23 0b27ea5fad61c99d
2018-06-23 18:26:37 0b27ea5fad61c99d
2018-06-23 18:27:00 0b27ea5fad61c99d
2018-06-23 18:27:22 0b27ea5fad61c99d
2018-06-23 18:27:39 0b27ea5fad61c99d
2018-06-23 18:28:06 0b27ea5fad61c99d
2018-06-23 18:28:30 0b27ea5fad61c99d",
header = TRUE, stringsAsFactors = FALSE)
t1 <- tibble(contacts = c(5646,65748,115,498456,35135))
t2 <- tibble(contacts = c(5646,65748))
t3 <- tibble(contacts = c(5646,65748,115))
t4 <- tibble(contacts = c(5646))
t5 <- tibble(contacts = c(5646,65748))
df$data <- c(t1,t2,t3,t4,t5)
df
# date time uuid data
# 1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646, 65748
# 3 2018-06-23 18:26:01 0b27ea5fad61c99d 5646, 65748, 115
# 4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646
# 5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646, 65748
# 6 2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646, 65748
# 8 2018-06-23 18:27:39 0b27ea5fad61c99d 5646, 65748, 115
# 9 2018-06-23 18:28:06 0b27ea5fad61c99d 5646
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646, 65748
这篇关于在嵌套的小标题列上计算函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!