用python有效提取1-5克 [英] Effective 1-5 grams extraction with python

查看:89
本文介绍了用python有效提取1-5克的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3,000,000行的庞大文件,每行20-40个字.我必须从语料库中提取1至5克.我的输入文件是标记化的纯文本,例如:

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.:

This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .

目前,我正在按以下方式进行操作,但这似乎并不是提取1-5克的有效方法:

Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams:

#!/usr/bin/env python -*- coding: utf-8 -*-

import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')

with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
    # Extract words from file. 
    src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
    del src_words[-1] # Removes the final '<s>'
    trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
    del trg_words[-1] # Removes the final '<s>'

    # Unigrams count.
    src_unigrams = Counter(src_words) 
    trg_unigrams = Counter(trg_words) 
    # Sum of unigram counts.
    src_sum_unigrams = sum(src_unigrams.values())
    trg_sum_unigrams = sum(trg_unigrams.values())

    # Bigrams count.
    src_bigrams = Counter(zip(src_words,src_words[1:]))
    trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
    # Sum of bigram counts.
    src_sum_bigrams = sum(src_bigrams.values())
    trg_sum_bigrams = sum(trg_bigrams.values())

    # Trigrams count.
    src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
    trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
    # Sum of trigram counts.
    src_sum_trigrams = sum(src_bigrams.values())
    trg_sum_trigrams = sum(trg_bigrams.values())

还有其他方法可以更有效地做到这一点吗?

如何同时优化提取不同的N个词?

快速/优化python中的N-gram实现,本质上是这样:

zip(*[words[i:] for i in range(n)])

当针对双字母组对此进行硬编码时,n=2:

when hard-coded is this for bigrams, n=2:

zip(src_words,src_words[1:])

这是用于三字母组的n=3:

zip(src_words,src_words[1:],src_words[2:])

推荐答案

如果您只对最常见(常见)的n -gram(我想是这种情况)感兴趣,则可以重用 Apriori算法.在给定s_min的情况下,可以将其视为包含给定n -gram的行数的最小支持,它可以有效地搜索所有此类n -grams.

If you are interested only in the most common (frequent) n-grams (which is your case I suppose), you can reuse the central idea of the Apriori algorithm. Given s_min, a minimal support which can be thought as the number of lines that a given n-gram is contained in, it efficiently searches for all such n-grams.

想法如下:编写一个查询函数,该函数接受一个n -gram并测试它在语料库中包含多少次.准备好此类功能后(可以稍后进行优化),扫描整个主体并获取所有1 -gram(即裸令牌),并选择至少包含s_min次的令牌.这为您提供了频繁出现的1 -grams的子集F1.然后通过组合来自F1的所有1 -gram来测试所有可能的2 -gram.再次,选择那些符合s_min标准的内容,您将得到F2.通过组合F2中的所有2 -gram并选择常用的3 -gram,您将得到F3.重复Fn为非空.

The idea is as follows: write a query function which takes an n-gram and tests how many times it is contained in the corpus. After you have such a function prepared (may be optimized as discussed later), scan the whole corpus and get all the 1-grams, i.e. bare tokens, and select those which are contained at least s_min times. This gives you subset F1 of frequent 1-grams. Then test all the possible 2-grams by combining all the 1-grams from F1. Again, select those which hold the s_min criterion and you'll get F2. By combining all the 2-grams from F2 and selecting the frequent 3-grams, you'll get F3. Repeat for as long as Fn is non-empty.

许多优化可以在这里完成.从Fn组合n -gram时,您可以利用以下事实:n -grams xy只能组合形成(n+1) -gram iff x[1:] == y[:-1](可以在如果使用正确的散列,则任何n的恒定时间).此外,如果您有足够的RAM(对于您的主体,很多GB),则可以极大地加快查询功能.对于每个1 -gram,存储包含给定1 -gram的行索引的哈希集.将两个n -gram组合为(n+1) -gram时,请使用两个对应集合的交集,以获得可能包含(n+1) -gram的一组线.

Many optimizations can be done here. When combining n-grams from Fn, you can exploit the fact that n-grams x and y may only be combined to form (n+1)-gram iff x[1:] == y[:-1] (may be checked in constant time for any n if proper hashing is used). Moreover, if you have enough RAM (for your corpus, many GBs), you can extremely speed up the query function. For each 1-gram, store a hash-set of line indices containing the given 1-gram. When combining two n-grams into an (n+1)-gram, use intersection of the two corresponding sets, obtaining a set of lines where the (n+1)-gram may be contained.

时间复杂度随着s_min的减少而增加.这样做的好处是,在算法运行时,不频繁的(因此无趣的)n -gram被完全过滤掉,从而仅节省了频繁计算的时间.

The time complexity grows as s_min decreases. The beauty is that infrequent (and hence uninteresting) n-grams are completely filtered as the algorithm runs, saving computational time for the frequent ones only.

这篇关于用python有效提取1-5克的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆