文本 Python 中的重复短语 [英] repeated phrases in the text Python

查看:33
本文介绍了文本 Python 中的重复短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个问题,我不知道如何解决.请给个建议.

我有一个文本.大,大文本.任务是找出文本中所有长度为3(包含三个单词)的重复短语.

解决方案

在我看来,您有两个问题.

首先提出了一种标准化输入的有效方法.你说你想在输入中找到所有的三个词的短语,但是什么构成了一个短语?例如,the black dogThe black, dog? 是同一个短语吗?

正如 marcog 所建议的那样,一种方法是使用类似 re.findall 的东西.但这非常低效:它遍历您的整个输入并将单词复制到列表中,然后您必须处理该列表.如果您的输入文本很长,那将浪费时间和空间.

更好的方法是将输入视为流,并构建一个每次提取一个单词的生成器.这是一个示例,它使用空格作为单词之间的分隔符,然后从单词中去除非字母字符并将它们转换为小写:

<预><代码>>>>定义单词(文本):模式 = re.compile(r"[^\s]+")non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)在 pattern.finditer(text) 中匹配:nxt = non_alpha.sub("", match.group()).lower()if nxt: # 跳过空白的非字母单词产量 nxt>>>文本在湛蓝的大海上,献给约瑟夫·波特爵士 K.C.B.">>>列表(单词(文本))['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']

第二个问题是将标准化的单词分组为三个单词的短语.同样,这里是生成器高效执行的地方:

<预><代码>>>>定义短语(单词):短语 = []逐字逐句:短语.附加(字)如果 len(短语) >3:短语.remove(短语[0])如果 len(短语) == 3:产量元组(短语)>>>列表(短语(单词(文本)))[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea')', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('约瑟夫'、'搬运工'、'kcb')]

几乎可以肯定,该函数可能有一个更简单的版本,但这个版本很高效,而且不难理解.

重要的是,将生成器链接在一起只会遍历列表一次,并且不会在内存中构建任何大型临时数据结构.您可以使用结果构建一个以短语为键的 defaultdict:

<预><代码>>>>进口藏品>>>计数 = collections.defaultdict(int)>>>对于短语中的短语(单词(文本)):计数[短语] += 1

这会在计算短语时对 text 进行一次传递.完成后,查找字典中所有值大于 1 的条目.

I have a problem and I have no idea how to solve it. Please, give a piece of advice.

I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.

解决方案

You have, it seems to me, two problems.

The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?

A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.

A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:

>>> def words(text):
       pattern = re.compile(r"[^\s]+")
       non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
       for match in pattern.finditer(text):
           nxt = non_alpha.sub("", match.group()).lower()
           if nxt:  # skip blank, non-alpha words
               yield nxt


>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']

The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:

>>> def phrases(words):
        phrase = []
        for word in words:
            phrase.append(word)
            if len(phrase) > 3:
                phrase.remove(phrase[0])
            if len(phrase) == 3:
                yield tuple(phrase)

>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]

There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.

Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:

>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
        counts[phrase] += 1

This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.

这篇关于文本 Python 中的重复短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆