用dplyr总结逻辑数据框 [英] summarise logical dataframe with dplyr

查看:41
本文介绍了用dplyr总结逻辑数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用两个变量来总结一个数据帧-我基本上想将变量1分解为变量2,以便将结果绘制在100%堆叠的条形图中.

I'm trying to summarise a dataframe using two variables - I basically want to break down variable 1 by variable 2 in order to plot the results in a 100% stacked bar chart.

我有多个逻辑类型的列,可以将其分为两个主要类别,以用于创建细分.

I have multiple columns of type logical, which can be split between two main categories that will be used to create the breakdown.

我尝试使用 dplyr 中的 gather 将数据帧转换为长格式,但是输出不是我期望的.

I have tried to use gather from dplyr to transform the dataframe to longform, however the output is not what I expect.

topics_by_variable <- function (dataset, variable_1, variable_2) {

  #select variables columns
  variable_1_columns <- dataset[, data.table::`%like%`(names(dataset), variable_1)]
  variable_2_columns <- dataset[, data.table::`%like%`(names(dataset), variable_2)]
  #create new dataframe including only relevant columns
  df <- cbind(variable_1_columns, variable_2_columns)
  #transform df to long form
  new_df <- tidyr::gather(df, variable_2, count, names(variable_2_columns[1]):names(variable_2_columns)[length(names(variable_2_columns))], factor_key=FALSE)

  #count topics
  topic_count <- function (x) {
                  t <- sum(x == TRUE)
  }
  #group by variable 2 and count
  new_df <- new_df %>%
            dplyr::group_by(variable_2) %>%
            dplyr::summarise_at(topic_names, .funs = topic_count)

  #transform new_df to longform
  final_df <- tidyr::gather(new_df, topic, volume, names(variable_1_columns[1]):names(variable_1_columns)[length(names(variable_1_columns))], factor_key=FALSE)
  final_df <- data.frame(final_df)

这是我正在使用的数据集:

Here is the dataset I'm using:

structure(list(topic_su = c("TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"), topic_so = c("FALSE", 
"FALSE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE"), topic_cl = c("FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"
), topic_in = c("FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE"), topic_qu = c("FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"), topic_re = c("FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"), brands_ne = c("TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"
), brands_st = c("FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE"), brands_co = c("FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"
), brands_seg = c("FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE"), brands_sen = c("TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", 
"TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE", "TRUE"), brands_ta = c("FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "TRUE"), brands_tc = c("FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", 
"FALSE", "FALSE")), class = "data.frame", row.names = c(NA, -39L
))

所需的输出将是以下内容,但是当我使用collect时,体积图是行的总数,并且在所有品牌中都重复出现.

The desired output would be the following, however when I use gather the volume figure is the total number of rows and is repeated across all brands.

variable_2       topic                volume
   <chr>            <chr>              <int>
 1 brands_co     topic_su               10
 2 brands_ne     topic_su               17
 3 brands_seg    topic_su               10 
 4 brands_sen    topic_su               18
 5 brands_st     topic_su                0
 6 brands_ta     topic_su                1
 7 brands_tc     topic_su                0
 8 brands_co     topic_so               22
 9 brands_ne     topic_so               17
10 brands_seg    topic_so               11 
11 brands_sen    topic_so               23
12 brands_st     topic_so                0
13 brands_ta     topic_so                0
14 brands_tc     topic_so                0

推荐答案

假设您的数据集是 dt ,则可以执行以下操作:

Assuming that your dataset is dt you can do something like this:

library(dplyr)

expand.grid(brand = names(dt)[grepl("brands", names(dt))],         
            topic = names(dt)[grepl("topic", names(dt))],
            stringsAsFactors = F) %>%
  rowwise() %>%
  mutate(volume = sum(dt[brand] == "TRUE" & dt[topic] == "TRUE")) %>%
  ungroup()

# # A tibble: 42 x 3
#   brand      topic    volume
#   <chr>      <chr>     <int>
# 1 brands_ne  topic_su     17
# 2 brands_st  topic_su      0
# 3 brands_co  topic_su     10
# 4 brands_seg topic_su     10
# 5 brands_sen topic_su     18
# 6 brands_ta  topic_su      1
# 7 brands_tc  topic_su      0
# 8 brands_ne  topic_so     17
# 9 brands_st  topic_so      0
#10 brands_co  topic_so     22
# # ... with 32 more rows

该过程执行以下操作:

您将从原始数据集中获得与品牌"和主题"匹配的所有列名,并在它们之间创建所有可能的组合.

You get all column names (from original dataset) that match "brands" and "topic" and create all possible combinations between them.

对于每种组合,您都将获得原始数据集的相应列,并计算它们都为TRUE的次数.

For each combination, you get the corresponding columns of your original dataset and count how many times they are both TRUE.

另一种选择是使用向量化函数代替 rowwise ,这可能会更快:

An alternative could be to use a vectorised function instead of rowwise, which might be faster:

# vectorised function
GetVolume = function(x,y) sum(dt[x] == "TRUE" & dt[y] == "TRUE")
GetVolume = Vectorize(GetVolume)

expand.grid(brand = names(dt)[grepl("brands", names(dt))],         
            topic = names(dt)[grepl("topic", names(dt))],
            stringsAsFactors = F) %>%
  mutate(volume = GetVolume(brand, topic)) 

这篇关于用dplyr总结逻辑数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆