相交文本以查找常用词 [英] Intersecting texts to find common words
问题描述
我正在试图找出哪一种是最佳的交叉方式,一组文本并在其中找到常用词。鉴于这种情况:
I'm trying to find out which would be the most optimal way of intersection a set of texts and find the common words in them. Given this scenario:
var t1 = 'My name is Mary-Ann, and I come from Kansas!';
var t2 = 'John, meet Mary, she comes from far away';
var t3 = 'Hi Mary-Ann, come here, nice to meet you!';
交叉口结果应为:
var result =["Mary"];
它应该能够忽略像。,!? - <的标点符号。 / code>
It should be able to ignore punctuation marks like .,!?-
正则表达式的解决方案是否最优?
Would a solution with regular expressions be optimal?
推荐答案
这是一个经过测试的解决方案:
Here's a tested solution :
function intersect() {
var set = {};
[].forEach.call(arguments, function(a,i){
var tokens = a.match(/\w+/g);
if (!i) {
tokens.forEach(function(t){ set[t]=1 });
} else {
for (var k in set){
if (tokens.indexOf(k)<0) delete set[k];
}
}
});
return Object.keys(set);
}
此函数是可变参数,您可以使用任意数量的文本调用它:
This function is variadic, you can call it with any number of texts :
console.log(intersect(t1, t2, t3)) // -> ["Mary"]
console.log(intersect(t1, t2)) // -> ["Mary", "from"]
console.log(intersect()) // -> []
如果你需要支持非英语语言,那么这个正则表达式是不够的,因为JavaScript正则表达式中对Unicode的不良支持。要么使用正则表达式库,要么明确定义正则表达式排除字符,如 a.match(/ [^ \\\\ - 。,!?] + / g);
(这可能就足够了)。
If you need to support non English languages, then this regex won't be enough because of the poor support of Unicode in JavaScript regexes. Either you use a regex library or you define your regex by explicitly excluding characters as in a.match(/[^\s\-.,!?]+/g);
(this will probably be enough for you) .
详细说明:
这个想法是用第一个文本的标记填充一个集合,然后从集合中删除其他文本中缺少的标记。
The idea is to fill a set with the tokens of the first text and then remove from the set the tokens missing in the other texts.
- 该集合是用作地图的JavaScript对象。一些纯粹主义者会使用
Object.create(null)
来避免原型,我喜欢{}
的简单性。 - 因为我希望我的功能是 variadic ,我使用
参数
而不是将传递的文本定义为显式参数。 -
arguments
不是真正的数组,所以要迭代它你需要一个for
循环或一个技巧,如[]。forEach.call
。它的工作原理是因为参数
是array-like。 - 要标记化,我只需使用
匹配
以匹配单词,这里没什么特别的(请参阅上面关于更好地支持其他语言的说明) - 我使用
!i
来检查它是否是第一个文字。在这种情况下,我只需将标记复制为集合中的属性。必须使用一个值,我使用1
。将来, ES6设置将使意图在这里变得更加明显。 - 对于以下文本,我迭代集合的元素(键)并删除那些不在数组中的元素令牌(
tokens.indexOf(k)< 0
) - 最后,我返回集合的元素,因为我们想要一个数组。最简单的解决方案是使用
Object.keys
。
- The set is a JavaScript object used as a map. Some purists would have used
Object.create(null)
to avoid a prototype, I like the simplicity of{}
. - As I want my function to be variadic, I use
arguments
instead of defining the passed texts as explicit arguments. arguments
isn't a real array, so to iterate over it you need either afor
loop or a trick like[].forEach.call
. It works becausearguments
is "array-like".- To tokenize, I simply use
match
to match words, nothing special here (see note above regarding better support of other languages, though) - I use
!i
to check if it's the first text. In that case, I simply copy the tokens as properties in the set. A value must be used, I use1
. In the future, ES6 sets will make the intent more obvious here. - For the following texts, I iterate over the elements of the sets (the keys) and I remove the ones which are not in the array of tokens (
tokens.indexOf(k)<0
) - Finally, I return the elements of the sets because we want an array. The simplest solution is to use
Object.keys
.
这篇关于相交文本以查找常用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!