Python:在大文本中替换单词 [英] Python: words replacing in huge text

查看:38
本文介绍了Python:在大文本中替换单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的文本和一个大约 10K 的单词列表.Python 中用其他单词替换文本中所有这些单词的最快方法是什么?

I have a huge text and a list of words ~10K. What is the fastest way in Python to replace all this words in text with some other word?

文本大小 >1Gb,文本是人工编写的,并且高度标记化"(任何字母数字字符和任何其他单个符号的运行都被拆分为新的标记)

Text size >1Gb, text is human written, and "extremely tokenized" (any runs of alphanumeric characters and any other single symbols was splitted into new tokens)

词数>10K,文本中每个词频为1替换词在所有替换中都相同.Python 2.5-2.7

a number of words >10K, each word frequency in text is 1 the replacement word is same in all replacements. Python 2.5-2.7

推荐答案

输入格式和搜索/替换配对信息将需要完善这个答案,如果它接近开始,但这将是我最初的尝试(假设输入数据具有某种形式的规律性,下面的示例代码中以空格分隔).

Input format and search / replace pairings info is going to necessary to refine this answer if it comes close to start with, but this would be my initial stab at it (assuming some form of regularity in the input data, space delimited in my example code below).

replacements = {
  's1': 'r1',
  's2': 'r2'
  ...
}

with open('input.txt') as fhi, open('output.txt', 'w') as fho:
  for line in fhi:
    words = line.split(' ')

    fho.write(' '.join(map(lambda w: replacements.get(w, w), words))

    # Or as a list comprehension from the comments.
    fho.write(' '.join([replacements.get(w, w) for w in words]))

这里的想法是我们将数据从输入文件重定位到输出文件中.对于每一行的每个单词,我们检查它是否在我们的替换字典中.如果是,我们检索新值,否则通过 dict.get(key[, default]) 方法返回未更改的单词.这可能并不理想,不能处理标点符号,可能会在输入文件没有被分成几行等时遇到问题,但可能是一种入门方式.

The idea here is that we'll be relocating data into an output file from an input file. For each word of each line, we check to see if it's in our replacements dictionary. We retrieve the new value if it is, or return the word unchanged otherwise via the dict.get(key[, default]) method. This may not be ideal, doesn't handle punctuation, would probably have trouble on an input file that wasn't broken into lines, etc, but may be a way to get started.

这篇关于Python:在大文本中替换单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆