给定一个搜索词列表,我如何知道我的字符串包含哪些? [英] Given a list of search terms, how can I tell which ones my string contains?

查看:46
本文介绍了给定一个搜索词列表,我如何知道我的字符串包含哪些?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多软件会使用搜索字符串并在包含它的数据库中查找所有文本(MySQL 的 WHERE MATCH('searchterm', string_column)、Google 等),但是有没有一种好的算法可以反过来呢?

There's a lot of software that will take a search string and find all of the text in your database that contains it (MySQL's WHERE MATCH('searchterm', string_column), Google, etc.), but is there a good algorithm for going the other way?

假设我有一个搜索词列表:

Say I have a list of search terms:

丰田普锐斯、丰田塔科马、本田思域、雪佛兰新星、雪佛兰伏特

Toyota Prius, Toyota Tacoma, Honda Civic, Chevy Nova, Chevy Volt

我有一个字符串,比如:

And I have a string, like:

1962 年雪佛兰 Nova 敞篷车

1962 Chevy Nova convertable

有没有好的算法可以让我把列表和字符串放进去,然后把 Chevy Nova 取出来?

Is there a good algorithm where I can put the list and the string in, and get Chevy Nova out?

如果它们都很容易被标记,我可以对它们进行标记并进行内部连接,但我对无法判断输入字符串的哪一部分是重要"部分的情况感兴趣.

If they're all easily tokenized, I could tokenize them and do an inner join, but I'm interested in the case where I can't tell which part of the input string is the "important" part.

推荐答案

如果您对1962 Chevy Nova convertable"[原文如此] 进行标记化,您最终会得到四个都很重要或足够有趣的标记,值得关注.如果您要跟踪您的语言中所有可能的单词,您将拥有每个单词的索引.

if you're tokenizing the "1962 Chevy Nova convertable" [sic] you'll end up with four tokens that are all important or interesting enough to care about. if you're keeping track of all of the possible words in your language, you'll have an index for each of those words.

另一方面,您已经获得了搜索词.在每种情况下,您都对有趣的词进行了标记和索引.每一个都可以看作是一对两个令牌索引.

and on the other hand, you've got your search terms. in each of those cases, you've tokenized and indexed the interesting words. each of those can be though of as a pair of two token indexes.

然后,如果您输入并查找匹配的搜索词,您会问哪些搜索词包含输入的任何单词?

then if you take your input and look for search terms that match, you'll be asking which of the search terms have any of the words of the input?

因为我本质上是一个数据库专家,所以我可以想象像这样创建令牌列表:

since I'm a database guy at heart, I can imagine creating the token list like so:

CREATE TABLE aa_tokens (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
  word VARCHAR( 40 ) NOT NULL 
);

insert into aa_tokens (word) values
  ('1962'),           -- 1
  ('Chevy'),          -- 2
  ('Civic'),          -- 3
  ('Honda'),          -- 4
  ('Nova'),           -- 5
  ('Prius'),          -- 6
  ('Tacoma'),         -- 7
  ('Toyota'),         -- 8
  ('Volt'),           -- 9
  ('convertable');    -- 10

和一个搜索表,以便每个都可以有一个 id:

and a table of searches so that each can have an id:

CREATE TABLE aa_search (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
  text VARCHAR( 255 ) NOT NULL
);

insert into aa_search (text) values
  ('Toyota Prius'),   -- 1
  ('Toyota Tacoma'),  -- 2
  ('Honda Civic'),    -- 3
  ('Chevy Nova'),     -- 4
  ('Chevy Volt');     -- 5

然后是一个结合搜索和标记的表格:

and then a table combining the searches and tokens:

CREATE TABLE aa_searchToks (
  search INT NOT NULL,
  token INT NOT NULL
);

insert into aa_searchToks (search, token) values
  (1, 8),
  (1, 6),
  (2, 8),
  (2, 7),
  (3, 4),
  (3, 3),
  (4, 2),
  (4, 5),
  (5, 2),
  (5, 9);

现在,如果我们将输入字符串1962 Chevy Nova convertable"转换为标记 (1, 2, 5, 10),我们可以进行查询,查看搜索词的标记:

now if we take the input string "1962 Chevy Nova convertable" and turn it into tokens (1, 2, 5, 10), we can make a query that looks at the tokens of the search terms:

select search, count(*) from aa_searchToks
  where token in (1, 2, 5, 10) group by search;

结果是:

+--------+----------+
| search | count(*) |
+--------+----------+
|      4 |        2 |
|      5 |        1 |
+--------+----------+

或者稍微不同的查询:

select search, (select text from aa_search s where st.search = s.id) as text, 
  count(*) from aa_searchToks st where token in (1, 2, 5, 10) group by search;

导致:

+--------+------------+----------+
| search | text       | count(*) |
+--------+------------+----------+
|      4 | Chevy Nova |        2 |
|      5 | Chevy Volt |        1 |
+--------+------------+----------+

我们可以看到Chevy Nova"匹配两个标记并且是最佳匹配,当然,它是.

we can see that "Chevy Nova" matches two tokens and is the best match, which, of course, it is.

这篇关于给定一个搜索词列表,我如何知道我的字符串包含哪些?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆