从文本文件中读取句子,并使用Python 3追加到列表中 [英] Reading sentences from a text file and appending into a list with Python 3

查看:41
本文介绍了从文本文件中读取句子,并使用Python 3追加到列表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在弄清楚如何处理冗长文档的文本文件并将该文本文件中的每个句子添加到列表时遇到了麻烦.并非所有句子都以句点结尾,因此必须考虑所有结尾字符,但也可能有一个.".在一个句子中进行搜索,因此我不能只是在一段时间内切断整个句子的搜索范围.我假设可以通过添加一个条件来解决此问题,条件是在句点之后应加一个空格,但是我不知道如何设置它,因此我将文本文件中的每个句子作为元素.

I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.

我正在编写的程序实质上将允许用户输入关键字搜索(关键字),并输入在找到关键字的句子之前和之后要返回的多个句子(值).因此,它或多或少是一个研究助手,因此用户不必阅读大量的文本文件即可找到所需的信息.

The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.

根据我到目前为止所学到的内容,将句子放入列表将是解决此问题的最简单方法,但是我无法弄清它的第一部分.如果我能弄清楚这部分,其余部分应该很容易放在一起.

From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.

所以我想简单地说,

如果我有 Sentence的文档.句子.句子.句子.句子.句子.句子.句子.句子.句子.句子.句子.

我需要以下形式的文档内容列表:

I need a list of the document contents in the form of:

sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]

推荐答案

这是一个非常棘手的问题,而且没有一个简单的答案.您可以尝试编写捕获所有已知情况的正则表达式,但是复杂的正则表达式往往难以维护和调试.有许多现有的库可以帮助您解决此问题.最著名的是自然语言工具包,它内置了许多标记器.您可以使用pip eg安装该工具

That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.

pip install nltk

然后获得您的句子将是一件相当简单的事情(尽管高度可定制).这是使用提供的句子标记器的简单示例

And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer

import nltk
with(open('text.txt', 'r') as in_file):
    text = in_file.read()
    sents = nltk.sent_tokenize(text)

如果不是用普通的标点符号,我还不太清楚如何分隔句子,但是我在文本上运行了上面的代码:

I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:

[我在弄清楚如何处理冗长文档的文本文件并将该文本文件中的每个句子添加到列表中时遇到麻烦.",

[ "I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",

并非所有句子都将以句号结尾,因此必须考虑所有结尾字符,但也可能有一个'.'",

"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",

在一个句子内,所以我不能只是在一定时期内切断整个句子.",

"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",

我假设可以通过添加一个条件来解决此问题,该条件是在句点之后应加一个空格,但是我不知道如何设置它,因此我将文本文件中的每个句子放入列表作为元素.\ n \ n]

"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n" ]

但是在以下输入中失败:[这是一个带有.的句子,"中间有一个句点.]

But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]

在传递类似以下内容的输入时:[这是一个句子,中间有一个句点"]

while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]

不过,我不知道您是否会比这更好.从nltk代码中:

I don't know if you're going to get much better than that right out of the box, though. From the nltk code:

使用无监督算法构建的句子标记器缩写词,搭配词和开头词的模型句子;然后使用该模型查找句子边界.事实证明,这种方法对许多欧洲人都有效语言.

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

因此,nltk解决方案实际上是使用机器学习来构建句子模型.比正则表达式好得多,但仍不完美.该死的自然语言.>:(

So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(

希望这会有所帮助:)

这篇关于从文本文件中读取句子,并使用Python 3追加到列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆