索引文档中单词的最有效方法? [英] Most efficient way to index words in a document?
问题描述
这是另一个问题,但我认为最好将其作为一个单独的问题提出.给出大量的句子列表(10万个顺序):
This came up in another question but I figured it is best to ask this as a separate question. Give a large list of sentences (order of 100 thousands):
[
"This is sentence 1 as an example",
"This is sentence 1 as another example",
"This is sentence 2",
"This is sentence 3 as another example ",
"This is sentence 4"
]
编写以下函数的最佳方法是什么?
what is the best way to code the following function?
def GetSentences(word1, word2, position):
return ""
在给定两个词word1
,word2
和位置position
的情况下,该函数应返回满足该约束的所有语句的列表.例如:
where given two words, word1
, word2
and a position position
, the function should return the list of all sentences satisfying that constraint. For example:
GetSentences("sentence", "another", 3)
应返回句子1
和3
作为句子的索引.我目前的方法是使用像这样的字典:
should return sentences 1
and 3
as the index of the sentences. My current approach was using a dictionary like this:
Index = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))
for sentenceIndex, sentence in enumerate(sentences):
words = sentence.split()
for index, word in enumerate(words):
for i, word2 in enumerate(words[index:):
Index[word][word2][i+1].append(sentenceIndex)
但是,由于我的48GB RAM在不到5分钟的时间里用完了,这很快就使数据集上的所有数据超出了比例,该数据集大小约为130 MB.我以某种方式感到这是一个普遍的问题,但是找不到有关如何有效解决此问题的参考.关于如何解决这个问题有什么建议吗?
But this quickly blows everything out of proportion on a dataset that is about 130 MB in size as my 48GB RAM is exhausted in less than 5 minutes. I somehow get a feeling this is a common problem but can't find any references on how to solve this efficiently. Any suggestions on how to approach this?
推荐答案
使用数据库存储值.
- 首先将所有句子添加到一张桌子(它们应具有ID).您可以这样称呼它.
sentences
.
其次,在所有句子中包含单词的第二次创建带有单词的表(将其命名为 -
当搜索包含所有提到的单词的句子时,您的工作将得到简化:
words
,为每个单词指定一个ID),从而将句子的表记录与单词的表记录之间的连接保存在其中单独的表(例如,称为sentences_words
,它应该具有两列,最好是word_id
和sentence_id
).
- First add all the sentences to one table (they should have IDs). You may call it eg.
sentences
. - Second, create table with words contained within all the sentences (call it eg.
words
, give each word an ID), saving connection between sentences' table records and words' table records within separate table (call it eg.sentences_words
, it should have two columns, preferablyword_id
andsentence_id
). When searching for sentences containing all the mentioned words, your job will be simplified:
-
您应该首先从
words
表中查找记录,其中的单词正是您要搜索的单词.查询看起来可能像这样:
You should first find records from
words
table, where words are exactly the ones you search for. The query could look like this:
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3');
第二,您应该从表sentences
中找到需要word_id
值的sentence_id
值(对应于words
表中的单词).初始查询可能如下所示:
Second, you should find sentence_id
values from table sentences
that have required word_id
values (corresponding to the words from words
table). The initial query could look like this:
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN ([here goes list of words' ids]);
可以简化为:
SELECT `sentence_id`, `word_id` FROM `sentences_words`
WHERE `word_id` IN (
SELECT `id` FROM `words` WHERE `word` IN ('word1', 'word2', 'word3')
);
在Python中过滤结果,仅返回具有所需所有word_id
ID的sentence_id
值.
Filter the result within Python to return only sentence_id
values that have all the required word_id
IDs you need.
这基本上是一种基于以最适合此形式的形式存储大量数据的解决方案-数据库.
This is basically a solution based on storing big amount of data in the form that is best suited for this - the database.
- 如果您仅搜索两个单词,则可以在DBMS方面做更多(几乎所有事情).
- 考虑到您还需要位置差异,您应该将单词的位置存储在
sentences_words
表的第三列中(简称为position
),并且在搜索适当的单词时,您应该计算该值的差异两个词.
- If you will only search for two words, you can do even more (almost everything) on DBMS' side.
- Considering you need also position difference, you should store the position of the word within third column of
sentences_words
table (lets call it justposition
) and when searching for appropriate words, you should calculate difference of this value associated with both words.
这篇关于索引文档中单词的最有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!