计算字符串中逗号分隔的唯一值 [英] Count comma separated unique values in a string
问题描述
数据帧的前两列构成一个复合键,有一列 char 类型包含逗号分隔的整数.我的目标是制作一列,其中包含字符串中唯一整数的计数.我知道使用 str_split_fixed 将字符串转换为列然后计算唯一值的方法,但是由于字符串的长度,添加了大量列并且一切都滞后.有没有其他方法?实际数据集包含 500k 行和 53 列.示例数据集:
df
The first two columns of dataframe make a composite key and there's a column of type char which contains comma separated integers. My objective is to make a column which contains the count of unique integers in the string.
I know the approach of converting string to columns using str_split_fixed and then counting the unique values but due to the length of string a large number of columns are added and everything lags. Is there any other method?
The actual data set contains 500k rows and 53 columns.
Sample dataset :
df
c1 c2 c3
aa 11 1,13,4,5,4,7,9
bb 22 2,5,2,4,5,7,11,
cc 33 11,14,3,1,
dd 44 1,1,2,4,5,6,15,
ee 55 4,3,3,1,14,17,
期望的输出:
c1 c2 c3 c4
------ | ------ | ------ | -----
aa | 11 | 1,13,4,5,4,7,9 | 6
------ | ------ | ------ | -----
bb | 22 | 2,5,2,4,5,7,11, | 5
------ | ------ | ------ | -----
cc | 33 | 11,14,3,1, | 4
------ | ------ | ------ | -----
dd | 44 | 1,1,2,4,5,6,15, | 6
------ | ------ | ------ | -----
ee | 55 | 4,3,3,1,7,17,7, | 5
------ | ------ | ------ | -----
任何帮助将不胜感激!
推荐答案
我们可以使用 stri_extract
来提取所有的数字,然后遍历list
,找到<unique
元素的代码>长度
We can use stri_extract
to extract all the numbers, then loop through the list
, find the length
of unique
elements
library(stringi)
df1$Count <- sapply(stri_extract_all_regex(df1$col3, "[0-9]+"),
function(x) length(unique(x)))
这篇关于计算字符串中逗号分隔的唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!