找出大列表中的哪些单词出现在小字符串中 [英] Find out which words in a large list occur in a small string

查看:43
本文介绍了找出大列表中的哪些单词出现在小字符串中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个静态的大"单词列表,大约 300-500 个单词,称为list1"

I have a static 'large' list of words, about 300-500 words, called 'list1'

给定一个相对较短的字符串str,大约40个字,ruby中最快的获取方法是什么:

given a relatively short string str of about 40 words, what is the fastest method in ruby to get:

  1. list1 中的单词出现在 str 中的次数(计算多次出现)
  2. list1 中哪些单词在字符串 str 中出现一次或多次的列表
  3. (2) 中的单词数
  1. the number of times a word in list1 occurs in str (counting multiple occurrences)
  2. a list of which words in list1 occur one or more times in the string str
  3. the number of words in (2)

'Occuring' 在 str 中的意思是作为 str 中的一个完整单词,或者作为 str 中的一个单词的一部分.因此,如果 'fred'list1 中并且 str 包含 'fred''freddie' 那将是两个匹配项.

'Occuring' in str means either as a whole word in str, or as a partial within a word in str. So if 'fred' is in list1 and str contained 'fred' and 'freddie' that would be two matches.

一切都是小写,所以任何匹配都不必关心大小写.

Everything is lowercase, so any matching does not have to care about case.

例如:

list1 ="fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"

so str 包含 samjackfred(两次)

so str contains sam, jack, fred (twice)

对于第 (1) 部分,表达式将返回 4 (sam+jack+fred+fred)
对于第 (2) 部分,表达式将返回sam jack fred"
而第 (3) 部分是 3

for part (1) the expression would return 4 (sam+jack+fred+fred)
for part (2) the expression would return "sam jack fred"
and part (3) is 3

4 小时后,我无法使用红宝石方式"来做这件事……通过迭代,它很容易(但速度很慢).任何帮助将不胜感激!

The 'ruby way' to do this eludes me after 4 hours... with iteration it's easy enough (but slow). Any help would be appreciated!

推荐答案

这是我的尝试:

def match_freq(exprs, strings)
  rs, ss, f = exprs.split.map{|x|Regexp.new(x)}, strings.split, {}
  rs.each{|r| ss.each{|s| f[r] = f[r] ? f[r]+1 : 1 if s=~r}}
  [f.values.inject(0){|a,x|a+x}, f, f.size]
end

list1 = "fred sam sandy jack sue bill"
str = "and so sammy went with jack to see fred and freddie"
x = match_freq(list1, str)
x # => [4, {/sam/=>1, /fred/=>2, /jack/=>1}, 3]

match_freq"的输出是您的输出项 (a,b,c) 的数组.算法本身是O(n*m),其中n 是list1 中的项目数,m 是输入字符串的大小,我不认为你可以做得比这更好(就大哦而言).但是有一些较小的优化可能会带来回报,比如为匹配总数保留一个单独的计数器,而不是在之后计算它.这只是我的快速破解.

The output of "match_freq" is an array of your output items (a,b,c). The algorithm itself is O(n*m) where n is the number of items in list1 and m is the size of the input string, I don't think you can do better than that (in terms of big-oh). But there are smaller optimizations that might pay off like keeping a separate counter for the total number of matches instead of computing it afterwards. This was just my quick hack at it.

您可以仅从输出中提取匹配的单词,如下所示:

You can extract just the matching words from the output as follows:

matches = x[1].keys.map{|x|x.source}.join(" ") # => "sam fred jack"

请注意,不一定会保留订单,如果这很重要,您必须保留一个单独的订单列表.

Note that the order won't be preserved necessarily, if that's important you'll have to keep a separate list of the order they were found.

这篇关于找出大列表中的哪些单词出现在小字符串中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆