如何使此随机文本生成器在Python中更有效? [英] How to make this random text generator more efficient in Python?

查看:116
本文介绍了如何使此随机文本生成器在Python中更有效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一种随机文本生成器-无需使用马尔可夫链-当前它没有太多问题.首先,这是我的代码流:

I'm working on a random text generator -without using Markov chains- and currently it works without too many problems. Firstly, here is my code flow:

  1. 输入一个句子作为输入-这称为触发字符串,分配给变量-

  1. Enter a sentence as input -this is called trigger string, is assigned to a variable-

获取触发字符串中最长的单词

Get longest word in trigger string

搜索所有Project Gutenberg数据库以查找包含该单词的句子-不论大写小写-

Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-

返回包含我在第3步中提到的单词的最长句子

Return the longest sentence that has the word I spoke about in step 3

将步骤1和步骤4中的句子一起添加

Append the sentence in Step 1 and Step4 together

将步骤4中的句子分配为新的触发"句子,然后重复该过程.请注意,我必须在第二句话中得到最长的单词,然后继续这样,依此类推-

Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-

这是我的代码:

import nltk
from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of  list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-

while triggerSentence:
    #so this is run every time through the loop
    split_str = triggerSentence.split()#split the sentence into words

    #code to find the longest word in the trigger sentence input
    for piece in split_str:
        if len(piece) > longestLength:
            longestString = piece
            longestLength = len(piece)

    #code to get the sentences containing the longest word, then selecting
    #random one of these sentences that are longer than 40 characters
    sets = []
    for sentence in listOfSents:
        if sentence.count(longestString):
            sents= " ".join(sentence)
            if len(sents) > 40:
            sets.append(" ".join(sentence))

    triggerSentence = choice(sets)
    print triggerSentence

我担心的是,循环大部分到达一个句子,一遍又一遍地打印相同的句子.因为它是最长的句子,所以单词最长.为了避免一遍又一遍地重复同一句话,我想到了以下几点:

My concern is, the loop mostly reaches to a point where the same sentence is printed over and over again. Since it is the longest sentence that has the longest word. To counter getting the same sentence over and over again, I thought of the following:

*如果当前句子中最长的单词与最后一个句子中的单词相同,则只需从当前句子中删除该最长单词,然后寻找下一个最长单词即可.

*If the longest word in the current sentence is the same as it was in the last sentence, simply delete this longest word from the current sentence and look for the next longest word.

我为此尝试了一些实现,但是由于涉及gutenberg模块中的单词和句子,因此涉及列表和列表列表,因此未能应用上述解决方案.关于如何找到第二长的单词有什么建议吗?我似乎无法通过解析简单的字符串输入来执行此操作,因为NLTK的Gutenberg模块的.sents()和.words()函数分别产生list和list的列表.提前致谢.

I tried some implementations for this but failed to apply the solution above since it involves lists and list of lists -due to words and sentences from gutenberg module-. Any suggestions about how to find the second longest word ? I seem to be unable to do this with parsing a simple string input since .sents() and .words() functions of NLTK's Gutenberg module yield list of list and list respectively. Thanks in advance.

推荐答案

一些建议的改进:

  1. while循环将永远运行,您应该删除它.
  2. 使用max和生成器表达式以节省内存的方式生成最长的单词.
  3. 您应生成一个长度大于40个字符的句子列表,其中包括具有列表理解的longestWord.这也应该从while循环中删除,因为它只会发生.

  1. The while loop will run forever, you should probably remove it.
  2. Use max and generator expressions to generate the longest word in a memory-efficient manner.
  3. You should generate a list of sentences with a length greater than 40 characters that include longestWord with a list comprehension. This should also be removed from the while loop, as it only happens.

已发送= [" ".join(sent) for sent in listOfSents if longestWord in sent and len(sent) > 40]

如果要打印出以随机顺序找到的每个句子,则可以尝试改组刚刚创建的列表:

If you want to print out every sentence that is found in a random order, then you could try shuffling the list you just created:

for sent in random.shuffle(sents): print sent

这是这些更改后代码的外观:

This is how the code could look with these changes:

import nltk
from nltk.corpus import gutenberg
from random import shuffle

listOfSents = gutenberg.sents()
triggerSentence = raw_input("Please enter the trigger sentence: ")

longestWord = max(triggerSentence.split(), key=len)
longSents = [" ".join(sent) for sent in listOfSents 
                 if longestWord in sent 
                 and len(sent) > 40]

for sent in shuffle(longSents):
    print sent

这篇关于如何使此随机文本生成器在Python中更有效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆