从相似字符串的向量中获取唯一字符串 [英] Get unique string from a vector of similar strings
问题描述
我不太知道如何表述这个问题.我刚刚开始处理一堆推文,我已经做了一些基本的清理工作,现在一些推文看起来像:
I don't quite know how to phrase the question. I have just started to work on a bunch of tweets, I've done some basic cleaning and now some of the tweets look like:
x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")
基本上我想通过检查字符串的第一部分是否匹配并返回其中最长的部分来删除重复项.在这种情况下,我的结果应该是:
Basically I want to remove the repetitions by checking if the first parts of the strings match and return the longest of them. In this case my result should be:
[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"
因为所有其他人都是上述内容的截断重复.我试过使用unique()
函数,但它没有返回我想要的结果,因为它试图匹配字符串的整个长度.请指点一下?
because all the others are truncated repetitions of the above. I've tried using the
unique()
function but it doesn't return the results I want because it tries to match the whole length of the strings. Any pointers please?
我在 Mac OSX 10.7 上使用 R 版本 3.1.1...
I'm using R version 3.1.1 on Mac OSX 10.7...
谢谢!
推荐答案
这是另一种选择.我已在您的示例数据中添加了一个字符串.
This is another option. I've added one string to your sample data.
x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")
Filter(function(y) {
x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)
# [1] "stackoverflow is a great site" "stackoverflow is an OK site" "omg it is friday and so sunny" [4] "arggh how annoying"
这篇关于从相似字符串的向量中获取唯一字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!