计算带有条件的另一列中某个字符串出现字符串的次数 [英] Count number of occurrences of a string from a column inside another column, with conditions
问题描述
我想计算过去五年内 animals.1 列中的单词出现在 animals.2 列中的次数:
I would like to count the number of times the words from a string in column animals.1 occur in the column animals.2 within the past five years:
> df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
> df
animals.1 animals.2 year
1 cat; dog; bird cat; dog; bird 2001
2 dog; bird dog; bird; seal 2005
3 bird bird 2010
4 dog 2018
需要输出
> df
animals.1 animals.2 year count
1 cat; dog; bird cat; dog; bird 2001 3
2 dog; bird dog; bird; seal 2005 4
3 bird bird 2010 1
4 dog 2018 0
编辑
在第2行 animal.1
= dog;鸟类
,过去5年出现在 animal.2栏中。 c =
狗;鸟
(2005年)和狗;鸟
(在2001年)。总计数= 4
In Row2 animal.1
= dog; bird
, appearances in previous 5 years in column animal.2
= dog; bird
(in 2005) and dog; bird
(in 2001) . Total Count = 4
第3行动物。1
= 鸟
,前五年出现在 animal.2
= bird
列中(2010年),而2005年是在我的五年范围之外。总计数= 1
In Row3 animals.1
= bird
, appearances in previous five years in column animal.2
= bird
(in 2010), whereas year 2005 is outside my five year range. Total Count = 1
我在以前的帖子。
但是,年份条件不能添加到提供的解决方案中。
I have asked a similar question, only without the year condition, in a previous post. However, the year condition cannot be added to the solutions provided.
任何帮助将不胜感激:)
Any help would be appreciated :)
推荐答案
您的代码尚未设为机器可读。机器在读取长数据以及执行分组和联接操作方面要好得多。
Your code is not yet made to be machine readable. Machines are much better at reading data that is "long" and performing grouping and joining operations.
当您寻找 x%in%y
时,您正在执行很多比较。然后执行字符串操作也会减慢您的速度(拆分字符串必须找到拆分字符串的位置)。我建议您将所有数据转换为长格式,然后将其保留为长格式,直到您需要宽格式以供人类查看为止。但我会以您的格式输出给您,因为问题是需要的。
When you are looking for x %in% y
you are performing lots of comparisons. Then performing string operations also slows you down (spliting a string has to find where to split the string). I would suggest converting all your data to long format and leaving it in long format until you need it in wide format for a human to look at. But I'm giving you the output in your format because the question asks for it.
以下大多数代码将您的数据转换为长数据格式。我在代码中采取了额外的步骤,以尝试分解数据进入计算的方式。
Most of the code below is converting your data into a long data format. I've put a extra steps in the code to try to break-down what the data looks like going into the computation.
library(dplyr)
library(tidyr)
library(stringr)
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"), animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),year= c("2001","2005","2010","2018"), stringsAsFactors = F)
# Convert the animal.1 column to long data
animals_1_long <- df %>%
rowwise() %>%
mutate(
animals_1 = str_split(animals.1,"; ")
) %>%
select(animals_1,year) %>%
unnest()
# # A tibble: 7 x 2
# year animals_1
# <chr> <chr>
# 1 2001 cat
# 2 2001 dog
# 3 2001 bird
# 4 2005 dog
# 5 2005 bird
# 6 2010 bird
# 7 2018 dog
# Similarly convert the animal.2 column to long data
animals_2_long <- df %>%
rowwise() %>%
mutate(
animals_2 = str_split(animals.2,"; ")
) %>%
select(animals_2,year) %>%
unnest()
# Since we want to match for the last 5 years, create a match index for year-4 to year.
animals_2_long_extend_5yrs <- animals_2_long %>%
rename(index_year = year) %>%
rowwise() %>%
mutate(match_year = list(as.character((as.numeric(index_year)-4):as.numeric(index_year)))) %>%
unnest()
# # A tibble: 40 x 3
# index_year animals_2 match_year
# <chr> <chr> <chr>
# 1 2001 cat 1997
# 2 2001 cat 1998
# 3 2001 cat 1999
# 4 2001 cat 2000
# 5 2001 cat 2001
# 6 2001 dog 1997
# 7 2001 dog 1998
# 8 2001 dog 1999
# 9 2001 dog 2000
# 10 2001 dog 2001
此时,animal_1数据的格式较长,每行一年。 animal_2数据采用长格式,每行一个动物/ match_year / index_year。这样一来,第二个数据集就可以一次连接覆盖过去5年的全部时间,然后将其汇总为我们最初感兴趣的年份。
At this point the animal_1 data is in long format with one animal/year per row. The animal_2 data is in long format with one animal/match_year/index_year per row. This allows the second dataset to cover all of the last 5 years in a single join, but then be summed up to the year we are originally interested in.
长数据集仅保留年份匹配match_year并且动物名称匹配的行。然后对index_year中剩余的行数求和是简单的。
Joining the two long datasets leaves only the rows where year matches match_year and the animal name matches. Then it is trivial to sum up the number of rows that are left in the index_year.
# Join the long data and the long data with the extended match index
animal_check <- animals_1_long %>%
rename(match_year = year) %>%
left_join(animals_2_long_extend_5yrs) %>%
filter(animals_1 == animals_2) %>%
# group by the index year and summarize the count
group_by(index_year) %>%
summarise(count = n()) %>%
rename(year = index_year)
# # A tibble: 3 x 2
# year count
# <chr> <int>
# 1 2001 3
# 2 2005 4
# 3 2010 1
至此,计算完成。剩下的就是将计数与动物一起添加回数据中。
At this point the calculation is done. All that is left is adding the count back to the data with the animals.
# Join the yearly result back to the original dataframe
df <- df %>%
left_join(animal_check)
df
# animals.1 animals.2 year count
# 1 cat; dog; bird cat; dog; bird 2001 3
# 2 dog; bird dog; bird; seal 2005 4
# 3 bird bird 2010 1
# 4 dog 2018 NA
更新:
# Data for benchmark:
df = data.frame(animals.1 = c("cat; dog; bird", "dog; bird", "bird", "dog"),
animals.2 = c("cat; dog; bird","dog; bird; seal", "bird", ""),
stringsAsFactors = F)
df <- replicate(10000,{df}, simplify=F) %>% do.call(rbind, .)
df$year <- as.character(seq(2000,2000 + nrow(df) - 1))
# microbenchmark results
min lq mean median uq max neval
5.785196 5.950748 6.642028 6.981055 7.001854 7.491287 5
这篇关于计算带有条件的另一列中某个字符串出现字符串的次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!