NLTK语料库阅读器段落 [英] NLTK corpus reader paragraph
问题描述
我试图将粘贴内容从Word文档(.docx)复制到.txt文件,并由nltk语料库阅读器进行读取以查找段落编号.它返回将近30个段落作为一个段落.我在.txt文件中手动输入了换行符,并返回了30个段落.
I tried to copy paste content from word document (.docx) to a .txt file and made it read by a nltk corpus reader to find number of paragraph. It returns almost 30 paragraph as one paragraph. I manually entered a line break in .txt file and it returned 30 paragraphs.
import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())
- PlaintextCorpus阅读器是否可以阅读.docx?
- 从.docx复制粘贴到.txt时,如何保留换行符?
- 有没有一种使用python的方法,我在其中打开.txt文件并找到?!or.或...,然后加上一些空格(数字4),然后按输入"以自动创建换行符? 休息.
- Is it possible for PlaintextCorpus reader to read .docx?
- While copy pasting from .docx to .txt, How to preserve line break?
- Is there a way using python,where I open .txt file and find ?!or . or ... and followed by some blank spaces(4 in number) and press "enter" to create line break automatically? break.
编辑1.
走了 para_block_reader = read_line_block 路径,但它总是使一个段落计数额外增加.
Walked the para_block_reader=read_line_block path, but it always gives one paragraph count extra.
import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())
推荐答案
纯文本语料库阅读器只能读取纯文本文件.有一些Python库可以读取docx,但不能解决您的问题,即Word用一个换行符分隔段落,但是传统上,纯文本文档将段落边界理解为空行, -即两个连续的换行符.换句话说,您的导出方法确实保留换行符.只是数量不够.
The plaintext corpus reader can only read plain-text files. There are Python libraries that can read docx, but that will not address your problem, which is that Word delimits paragraphs by a single line break, but plaintext documents traditionally understand a paragraph boundary to be a blank line-- i.e., two successive newlines. In other words, your export method does preserve the newlines; it's just that there's not enough of them.
因此,有一种简单的方法可以修正文本,从而无需额外的操作即可识别段落:写完纯文本文件后(可以从Word的Save As...
菜单中进行操作,也可以通过剪切和粘贴来完成) ),像这样进行后处理(必要时添加encoding=
参数):
So there is an easy way to fix up your texts so that paragraphs are recognized without extra to-do: Once you've written out your plaintext file (which you can do from Word's Save As...
menu or by cutting and pasting), post-process it like this (add encoding=
arguments as necessary):
with open("my_plaintext.txt") as oldfile:
content = oldfile.read()
content = re.sub("\n", "\n\n", content)
with open("my_plaintext_fixed.txt", "w") as newfile:
newfile.write(content)
您现在可以阅读myplaintext_fixed.txt" with the
PlaintextCorpusReader`,一切都会按预期进行.
You can now read myplaintext_fixed.txt" with the
PlaintextCorpusReader`, and everything will work as expected.
这篇关于NLTK语料库阅读器段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!