创建gensim字典时添加进度条(详细) [英] Add progress bar (verbose) when creating gensim dictionary
问题描述
我想从数据帧的行创建 gensim字典. df.preprocessed_text
是单词列表.
I want to create a gensim dictionary from lines of a dataframe. The df.preprocessed_text
is a list of words.
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary
def create_dict(df, bigram=True, min_occ_token=3):
token_ = df.preprocessed_text.values
if not bigram:
return Dictionary(token_)
bigram = Phrases(token_,
min_count=3,
threshold=1,
delimiter=b' ')
bigram_phraser = Phraser(bigram)
bigram_token = []
for sent in token_:
bigram_token.append(bigram_phraser[sent])
dictionary = Dictionary(bigram_token)
dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
dictionary.compactify()
return dictionary
我找不到它的进度条选项,而回调却没有似乎也没有用.由于我的语料库很大,所以我非常欣赏显示进度的方法.有吗?
I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?
推荐答案
我建议再次出于监视目的更改 prune_at
,因为它会更改记住双字词/单词的行为,可能会丢弃远远超出了限制内存使用的严格要求.
I'd recommend agains changing prune_at
for monitoring purposes, as it changes the behavior around which bigrams/words are remembered, possibly discarding many more than is strictly required for capping memory usage.
将 tqdm
围绕使用的可迭代对象(包括 Phrases
构造函数中的 token _
使用和 bigram_token
使用在 Dictionary
构造函数中)应该可以正常工作.
Wrapping tqdm
around the iterables used (including the token_
use in the Phrases
constructor and the bigram_token
use in the Dictionary
constructor) should work.
或者,启用 INFO
或更高级别的日志记录应该显示日志记录,尽管它不像进度条那么漂亮/准确,但却会提供一些进度指示.
Alternatively, enabling INFO
or greater logging should display logging that, while not as pretty/accurate as a progress-bar, will give some indication of progress.
此外,如果如代码中所示,使用 bigram_token
仅是为了支持下一个 Dictionary
,则不必将其创建为完整的内存<代码>列表.您应该能够只使用分层的迭代器来转换文本,&逐项统计 Dictionary
.EG:
Further, if as shown in the code, the use of bigram_token
is only to support the next Dictionary
, it need not be created as a full in-memory list
. You should be able to just use layered iterators to transform the text, & tally the Dictionary
, item-by-item. EG:
# ...
dictionary = Dictionary(tqdm(bigram_phraser[token_]))
# ...
(此外,如果只使用一次 Phrases
调查对象的全部开销,但是,如果 Phrases
仍在范围内,则在此步骤之后将立即将其全部丢弃,直接使用 Phrases
对象可能速度一样快,而无需绕道创建 Phraser
-因此,请尝试一下.)
(Also, if you're only using the Phraser
once, you may not be getting any benefit from creating it at all - it's an optional memory optimization for when you want to keep applying the same phrase-creation operation without the full overhead of the original Phrases
survey object. But if the Phrases
is still in-scope, and all of it will be discarded immediately after this step, it might be just as fast to use the Phrases
object directly without ever taking a detour to create the Phraser
- so give that a try.)
这篇关于创建gensim字典时添加进度条(详细)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!