从大文本中提取n克 [英] extracting n grams from huge text

查看:72
本文介绍了从大文本中提取n克的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我们有以下文本:

For example we have following text:

"Spark是用于编写快速,分布式程序的框架.Spark 解决了与Hadoop MapReduce相似的问题,但是速度很快 内存方法和干净的功能样式API. ..."

"Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..."

我分别需要此文本的所有可能的部分,一个单词一个单词,然后两个,两个,三个,三个到五个到五个. 像这样:

I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this:

ones:["Spark","is","a","framework","for","writing","fast", 分布式",程序",...]

ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast', 'distributed', 'programs', ...]

twos:['Spark is','is a','a framework','framework for','for Writing' ...]

twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing' ...]

threes:["Spark是一个",是一个框架",一个框架", 写作框架",快速写作",...]

threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]

. .

fives:['Spark是一个框架,'是一个写作框架, 一个用于快速编写的框架",一个用于快速分布式编写的框架",...]

fives : ['Spark is a framework for', 'is a framework for writing', 'a framework for writing fast','framework for writing fast distributed', ...]

请注意,要处理的文本是巨大的文本(大约100GB). 我需要针对此过程的最佳解决方案.可能应该并行处理多线程.

Please note that the text to be processed is huge text( about 100GB). I need the best solution for this process. May be it should be processed multi thread in parallel.

我不需要一次完整的列表,它可以流式传输.

I don't need whole list at once, it can be streaming.

推荐答案

首先,请确保您的文件中有行,然后不用担心,可以逐行阅读它(

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):

with open('my100GBfile.txt') as corpus:
    for line in corpus:
        sequence = preprocess(line)
        extract_n_grams(sequence)

让我们假设您的语料库不需要任何特殊处理.我想您可以为您的文本找到合适的处理方式,我只希望将其丢入所需的标记中:

Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:

def preprocess(string):
    # do what ever preprocessing that it needs to be done
    # e.g. convert to lowercase: string = string.lower()
    # return the sequence of tokens
    return string.split()

我不知道您想对n-gram做什么.让我们假设您想将它们视为适合您内存的语言模型(通常可以,但是我不确定4克和5克).最简单的方法是使用现成的nltk库:

I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk library:

from nltk.util import ngrams

lm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):
    for n in range(1,6):
        ngram = ngrams(sentence, n)
        # now you have an n-gram you can do what ever you want
        # yield ngram
        # you can count them for your language model?
        for item in ngram:
            lm[n][item] = lm[n].get(item, 0) + 1

这篇关于从大文本中提取n克的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆