计算与其他列的双重类别相关联的列中的特定字符.基于频率仓迭代地做 [英] Count specific characters from column associated with dual categories of other column. Do it iteratively based on frequency bins

查看:15
本文介绍了计算与其他列的双重类别相关联的列中的特定字符.基于频率仓迭代地做的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的数据框 df1,它的简化版本包含 3 列,单词"、频率"和字母":

I have a huge dataframe df1, whose oversimplified version consists of 3 columns, "Words", "Frequency" and "Letters":

Words           Frequency   Letters
flower/tree     0.15        a(0.1)
tree            0.67        a(0.4)
planet          0.85        b(0.4)
tree/planet     0.42        c(0.5)
tree            0.89        a(0.6)
flower          0.21        b(0.4)
flower/planet   0.53        b
planet          0.07        a

使用R(dplyr,应用家庭函数等)我想计算字母"列的每个字母(a,b,c)与单词"中的每个单词相关联的次数列(花、树、行星),以迭代方式依赖于频率"列值的频率仓.有 4 个 bin:[0, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1].

Using R (dplyr, apply family functions, etc.) I would like to count the number of times every letter (a, b, c) of the "Letter" column is associated with every single word from the "Word" column (flower, tree, planet), in an iterative way dependent on the frequency bin of the "Frequency" column values. There are 4 bins: [0, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1].

我希望输出数据帧 df2 看起来像这样:

I expect an output dataframe df2 that looks something like this:

Bin       Word    Letters    count_letters
0-0.25    flower  a          1
0-0.25    flower  b          1
0-0.25    tree    a          1
0-0.25    planet  a          1
0.25-0.5  tree    c          1
0.25-0.5  planet  c          1
0.5-0.75  flower  b          1
0.5-0.75  tree    a          1
0.5-0.75  planet  b          1
0.75-1    tree    a          1
0.75-1    planet  b          1

推荐答案

可以使用cut来bin Frequencysubstr来清理Letterstidyr::separate_rows 来取消嵌套 Word.用 dplyr::count 聚合,你就设置了:

You can use cut to bin Frequency, substr to clean Letters, and tidyr::separate_rows to unnest Word. Aggregate with dplyr::count, and you're set:

library(tidyverse)

df %>% separate_rows(Words) %>% 
    count(Words, 
          Letters = substr(Letters, 1, 1),    # use regex if more than one letter
          Frequency = cut(Frequency, breaks = seq(0, 1, .25)))

## Source: local data frame [11 x 4]
## Groups: Frequency, Words [?]
## 
##     Frequency  Words Letters     n
##        <fctr>  <chr>   <chr> <int>
## 1    (0,0.25] flower       a     1
## 2    (0,0.25] flower       b     1
## 3    (0,0.25] planet       a     1
## 4    (0,0.25]   tree       a     1
## 5  (0.25,0.5] planet       c     1
## 6  (0.25,0.5]   tree       c     1
## 7  (0.5,0.75] flower       b     1
## 8  (0.5,0.75] planet       b     1
## 9  (0.5,0.75]   tree       a     1
## 10   (0.75,1] planet       b     1
## 11   (0.75,1]   tree       a     1

这篇关于计算与其他列的双重类别相关联的列中的特定字符.基于频率仓迭代地做的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆