NLTK语料库阅读器段落 [英] NLTK corpus reader paragraph

查看：115 发布时间：2020/5/18 1:22:00 text-files nltk

本文介绍了NLTK语料库阅读器段落的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图将粘贴内容从Word文档(.docx)复制到.txt文件，并由nltk语料库阅读器进行读取以查找段落编号.它返回将近30个段落作为一个段落.我在.txt文件中手动输入了换行符，并返回了30个段落.

I tried to copy paste content from word document (.docx) to a .txt file and made it read by a nltk corpus reader to find number of paragraph. It returns almost 30 paragraph as one paragraph. I manually entered a line break in .txt file and it returned 30 paragraphs.

import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())

PlaintextCorpus阅读器是否可以阅读.docx?
从.docx复制粘贴到.txt时，如何保留换行符?
有没有一种使用python的方法，我在其中打开.txt文件并找到?！or.或...，然后加上一些空格(数字4)，然后按输入"以自动创建换行符? 休息.

Is it possible for PlaintextCorpus reader to read .docx?
While copy pasting from .docx to .txt, How to preserve line break?
Is there a way using python,where I open .txt file and find ?!or . or ... and followed by some blank spaces(4 in number) and press "enter" to create line break automatically? break.

编辑1.

走了 para_block_reader = read_line_block 路径，但它总是使一个段落计数额外增加.

Walked the para_block_reader=read_line_block path, but it always gives one paragraph count extra.

import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())

推荐答案

纯文本语料库阅读器只能读取纯文本文件.有一些Python库可以读取docx，但不能解决您的问题，即Word用一个换行符分隔段落，但是传统上，纯文本文档将段落边界理解为空行， -即两个连续的换行符.换句话说，您的导出方法确实保留换行符.只是数量不够.

The plaintext corpus reader can only read plain-text files. There are Python libraries that can read docx, but that will not address your problem, which is that Word delimits paragraphs by a single line break, but plaintext documents traditionally understand a paragraph boundary to be a blank line-- i.e., two successive newlines. In other words, your export method does preserve the newlines; it's just that there's not enough of them.

因此，有一种简单的方法可以修正文本，从而无需额外的操作即可识别段落:写完纯文本文件后(可以从Word的Save As...菜单中进行操作，也可以通过剪切和粘贴来完成) )，像这样进行后处理(必要时添加encoding=参数):

So there is an easy way to fix up your texts so that paragraphs are recognized without extra to-do: Once you've written out your plaintext file (which you can do from Word's Save As... menu or by cutting and pasting), post-process it like this (add encoding= arguments as necessary):

with open("my_plaintext.txt") as oldfile:
    content = oldfile.read()

content = re.sub("\n", "\n\n", content)

with open("my_plaintext_fixed.txt", "w") as newfile:
    newfile.write(content)

您现在可以阅读myplaintext_fixed.txt" with the PlaintextCorpusReader`，一切都会按预期进行.

You can now read myplaintext_fixed.txt" with thePlaintextCorpusReader`, and everything will work as expected.

这篇关于NLTK语料库阅读器段落的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

NLTK语料库阅读器段落 [英] NLTK corpus reader paragraph

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

NLTK语料库阅读器段落 [英] NLTK corpus reader paragraph

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭