创建gensim字典时添加进度条(详细) [英] Add progress bar (verbose) when creating gensim dictionary

查看:38
本文介绍了创建gensim字典时添加进度条(详细)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从数据帧的行创建 gensim字典. df.preprocessed_text 是单词列表.

I want to create a gensim dictionary from lines of a dataframe. The df.preprocessed_text is a list of words.

from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary


def create_dict(df, bigram=True, min_occ_token=3):

    token_ = df.preprocessed_text.values
    if not bigram:
        return Dictionary(token_)
    
    bigram = Phrases(token_,
                     min_count=3,
                     threshold=1,
                     delimiter=b' ')

    bigram_phraser = Phraser(bigram)

    bigram_token = []
    for sent in token_:
        bigram_token.append(bigram_phraser[sent])
    
    dictionary = Dictionary(bigram_token)
    dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
    dictionary.compactify() 
    
    return dictionary

我找不到它的进度条选项,而回调却没有似乎也没有用.由于我的语料库很大,所以我非常欣赏显示进度的方法.有吗?

I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?

推荐答案

我建议再次出于监视目的更改 prune_at ,因为它会更改记住双字词/单词的行为,可能会丢弃远远超出了限制内存使用的严格要求.

I'd recommend agains changing prune_at for monitoring purposes, as it changes the behavior around which bigrams/words are remembered, possibly discarding many more than is strictly required for capping memory usage.

tqdm 围绕使用的可迭代对象(包括 Phrases 构造函数中的 token _ 使用和 bigram_token 使用在 Dictionary 构造函数中)应该可以正常工作.

Wrapping tqdm around the iterables used (including the token_ use in the Phrases constructor and the bigram_token use in the Dictionary constructor) should work.

或者,启用 INFO 或更高级别的日志记录应该显示日志记录,尽管它不像进度条那么漂亮/准确,但却会提供一些进度指示.

Alternatively, enabling INFO or greater logging should display logging that, while not as pretty/accurate as a progress-bar, will give some indication of progress.

此外,如果如代码中所示,使用 bigram_token 仅是为了支持下一个 Dictionary ,则不必将其创建为完整的内存<代码>列表.您应该能够只使用分层的迭代器来转换文本,&逐项统计 Dictionary .EG:

Further, if as shown in the code, the use of bigram_token is only to support the next Dictionary, it need not be created as a full in-memory list. You should be able to just use layered iterators to transform the text, & tally the Dictionary, item-by-item. EG:

    # ...
    dictionary = Dictionary(tqdm(bigram_phraser[token_]))
    # ...

(此外,如果只使用一次 Phraser ,那么创建它可能根本不会获得任何好处-这是可选的内存优化,适用于希望保持相同状态的情况词组创建操作,而没有原始 Phrases 调查对象的全部开销,但是,如果 Phrases 仍在范围内,则在此步骤之后将立即将其全部丢弃,直接使用 Phrases 对象可能速度一样快,而无需绕道创建 Phraser -因此,请尝试一下.)

(Also, if you're only using the Phraser once, you may not be getting any benefit from creating it at all - it's an optional memory optimization for when you want to keep applying the same phrase-creation operation without the full overhead of the original Phrases survey object. But if the Phrases is still in-scope, and all of it will be discarded immediately after this step, it might be just as fast to use the Phrases object directly without ever taking a detour to create the Phraser - so give that a try.)

这篇关于创建gensim字典时添加进度条(详细)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆