在嵌套的小标题列上计算函数? [英] Calculate function on a column of nested tibbles?

查看:85
本文介绍了在嵌套的小标题列上计算函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有一列小标题的数据框。
这是我的部分数据:

 日期时间uuid数据
2018-06-23 18 :25:24 0b27ea5fad61c99d< tibble>
2018-06-23 18:25:38 0b27ea5fad61c99d< tibble>
2018-06-23 18:26:01 0b27ea5fad61c99d< tibble>
2018-06-23 18:26:23 0b27ea5fad61c99d< tibble>
2018-06-23 18:26:37 0b27ea5fad61c99d< tibble>
2018-06-23 18:27:00 0b27ea5fad61c99d< tibble>
2018-06-23 18:27:22 0b27ea5fad61c99d< tibble>
2018-06-23 18:27:39 0b27ea5fad61c99d< tibble>
2018-06-23 18:28:06 0b27ea5fad61c99d< tibble>
2018-06-23 18:28:30 0b27ea5fad61c99d< tibble>

这是我的功能:

  jaccard<-function(vector1,vector2){

return(length(intersect(vector1,vector2)))/
length(union(vector1, vector2)))

}

我的数据列由一个小标题组成字符列:

 联系人
5646
65748
115
498456
35135

我的目标是计算数据列中每2个连续小节之间的抽搐。 / p>

我尝试过:



df%>%mutate(j = jaccard( data,lag(data,1))),但由于某种原因似乎无法正常工作。



我知道我已经接近了,

解决方案

原因是未编写 jaccard 函数处理向量参数。如您所知,用作 mutate 的一部分的函数会接收到数据向量(在OP的情况下, $ 10 tibbles 的向量)例)。现在,由于未编写 jaccard 函数来处理vector(tibbles的向量)的参数,因此结果将不符合预期。



最简单的解决方法是对 jaccard 函数进行矢量化处理,使其可以处理矢量参数。一次可以使用 Vectorize 将函数转换为:

 #Function 
jaccard<-函数(vector1,vector2){
return(length(intersect(vector1,vector2))/
length(union(vector1,vector2)))
}
#jaccard函数的矢量化版本
jaccardV<-Vectorize(jaccard)


库(dplyr)
df%&%;%
mutate(j = jaccardV(data,lag(data,1)))

#日期时间uuid数据j
#1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646,65748 ,115,498456,35135 0.0000000
#2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646,65748 0.4000000
#3 2018-06-23 18:26:01 0b27ea5fad61c99d 5646,65748,115 0.6666667
#4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646 0.3333333
#5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646,65748 0.5000000
#6 2018-06 -23 18:27:00 0b27ea5fad61c99d 5646,65748,115,498456,35135 0.4000000
#7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646,65748 0.4000000
#8 2018-06-23 18: 27:39 0b27ea5fad61c99d 5646,65748,115 0.6666667
#9 2018-06-23 18:28:06 0b27ea5fad61c99d 5646 0.3333333
#10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646,65748 0.5000000

数据:

  df<-read.table(text = 
date time uuid
2018-06-23 18:25:24 0b27ea5fad61c99d
2018 -06-23 18:25:38 0b27ea5fad61c99d
2018-06-23 18:26:01 0b27ea5fad61c99d
2018-06-23 18:26:23 0b27ea5fad61c99d
2018-06-23 18 :26:37 0b27ea5fad61c99d
2018-06-23 18:27:00 0b27ea5fad61c99d
2018-06-23 18:27:22 0b27ea5fad61c99d
2018-06-23 18:27:39 0b27ea5fad61c99d
2018-06-2 3 18:28:06 0b27ea5fad61c99d
2018-06-23 18:28:30 0b27ea5fad61c99d,
标头= TRUE,stringsAsFactors = FALSE)

t1<-tibble(接触= c(5646,65748,115,498456,35135))
t2<-tibble(接触= c(5646,65748))
t3<-tibble(接触= c(5646, 65748,115))
t4<-tibble(触点= c(5646))
t5<-tibble(contacts = c(5646,65748))


df $ data<-c(t1,t2,t3,t4,t5)

df
#日期时间uuid数据
#1 2018-06-23 18:25:24 0b27ea5fad61c99d 5646,65748,115,498456,35135
#2 2018-06-23 18:25:38 0b27ea5fad61c99d 5646,65748
#3 2018-06-23 18:26: 01 0b27ea5fad61c99d 5646,65748,115
#4 2018-06-23 18:26:23 0b27ea5fad61c99d 5646
#5 2018-06-23 18:26:37 0b27ea5fad61c99d 5646,65748
# 6 2018-06-23 18:27:00 0b27ea5fad61c99d 5646,65748,115,498456,3513 5
#7 2018-06-23 18:27:22 0b27ea5fad61c99d 5646,65748
#8 2018-06-06 18:27:39 0b27ea5fad61c99d 5646,65748,115
#9 2018 -06-23 18:28:06 0b27ea5fad61c99d 5646
#10 2018-06-23 18:28:30 0b27ea5fad61c99d 5646,65748


I have a dataframe with a column of tibbles. Here is a portion of my data:

date        time        uuid                data
2018-06-23  18:25:24    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:25:38    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:01    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:23    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:37    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:00    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:22    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:39    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:06    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:30    0b27ea5fad61c99d    <tibble>

And this is my function:

jaccard <- function(vector1, vector2) {

  return(length(intersect(vector1, vector2)) / 
        length(union(vector1, vector2)))

}

My data column consists of tibbles with one column of characters:

contacts
5646
65748
115
498456
35135

My goal is to calculate jaccard between each 2 consecutive tibbles in the data column.

I have tried:

df %>% mutate(j = jaccard(data, lag(data, 1))) but it doesn't seem to work for some reason.

I know I am close, please advise.

解决方案

The reason is that jaccard function is not written to handle vector arguments. As you know that functions used as part of mutate receive a vector of data (vector of 10 tibbles in case of OP's example). Now, since jaccard function is not written to handle arguments of vector(vector of tibbles) the result will not meet expectation.

The easiest fix can be to vectorise jaccard function so that it can handle vector arguments. Once can use Vectorize to convert the function as:

# Function 
jaccard <- function(vector1, vector2) {
  return(length(intersect(vector1, vector2)) / 
           length(union(vector1, vector2)))
}
# Vectorised version of jaccard function
jaccardV <- Vectorize(jaccard)


library(dplyr)
df %>%
  mutate(j = jaccardV(data, lag(data, 1)))

#          date     time             uuid                            data         j
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.0000000
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646 0.3333333
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748 0.5000000
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.4000000
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646 0.3333333
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748 0.5000000

Data:

df <- read.table(text="
date        time        uuid                
2018-06-23  18:25:24    0b27ea5fad61c99d    
2018-06-23  18:25:38    0b27ea5fad61c99d    
2018-06-23  18:26:01    0b27ea5fad61c99d    
2018-06-23  18:26:23    0b27ea5fad61c99d    
2018-06-23  18:26:37    0b27ea5fad61c99d    
2018-06-23  18:27:00    0b27ea5fad61c99d    
2018-06-23  18:27:22    0b27ea5fad61c99d    
2018-06-23  18:27:39    0b27ea5fad61c99d    
2018-06-23  18:28:06    0b27ea5fad61c99d    
2018-06-23  18:28:30    0b27ea5fad61c99d",
header = TRUE, stringsAsFactors = FALSE)

t1 <- tibble(contacts = c(5646,65748,115,498456,35135))
t2 <- tibble(contacts = c(5646,65748))
t3 <- tibble(contacts = c(5646,65748,115))
t4 <- tibble(contacts = c(5646))
t5 <- tibble(contacts = c(5646,65748))


df$data <- c(t1,t2,t3,t4,t5)

df
#          date     time             uuid                            data
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748

这篇关于在嵌套的小标题列上计算函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆