NLTK语料库阅读器段落 [英] NLTK corpus reader paragraph

查看:115
本文介绍了NLTK语料库阅读器段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将粘贴内容从Word文档(.docx)复制到.txt文件,并由nltk语料库阅读器进行读取以查找段落编号.它返回将近30个段落作为一个段落.我在.txt文件中手动输入了换行符,并返回了30个段落.

I tried to copy paste content from word document (.docx) to a .txt file and made it read by a nltk corpus reader to find number of paragraph. It returns almost 30 paragraph as one paragraph. I manually entered a line break in .txt file and it returned 30 paragraphs.

import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())

  1. PlaintextCorpus阅读器是否可以阅读.docx?
  2. 从.docx复制粘贴到.txt时,如何保留换行符?
  3. 有没有一种使用python的方法,我在其中打开.txt文件并找到?!or.或...,然后加上一些空格(数字4),然后按输入"以自动创建换行符? 休息.
  1. Is it possible for PlaintextCorpus reader to read .docx?
  2. While copy pasting from .docx to .txt, How to preserve line break?
  3. Is there a way using python,where I open .txt file and find ?!or . or ... and followed by some blank spaces(4 in number) and press "enter" to create line break automatically? break.

编辑1.

走了 para_block_reader = read_line_block 路径,但它总是使一个段落计数额外增加.

Walked the para_block_reader=read_line_block path, but it always gives one paragraph count extra.

import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())

推荐答案

纯文本语料库阅读器只能读取纯文本文件.有一些Python库可以读取docx,但不能解决您的问题,即Word用一个换行符分隔段落,但是传统上,纯文本文档将段落边界理解为空行, -即两个连续的换行符.换句话说,您的导出方法确实保留换行符.只是数量不够.

The plaintext corpus reader can only read plain-text files. There are Python libraries that can read docx, but that will not address your problem, which is that Word delimits paragraphs by a single line break, but plaintext documents traditionally understand a paragraph boundary to be a blank line-- i.e., two successive newlines. In other words, your export method does preserve the newlines; it's just that there's not enough of them.

因此,有一种简单的方法可以修正文本,从而无需额外的操作即可识别段落:写完纯文本文件后(可以从Word的Save As...菜单中进行操作,也可以通过剪切和粘贴来完成) ),像这样进行后处理(必要时添加encoding=参数):

So there is an easy way to fix up your texts so that paragraphs are recognized without extra to-do: Once you've written out your plaintext file (which you can do from Word's Save As... menu or by cutting and pasting), post-process it like this (add encoding= arguments as necessary):

with open("my_plaintext.txt") as oldfile:
    content = oldfile.read()

content = re.sub("\n", "\n\n", content)

with open("my_plaintext_fixed.txt", "w") as newfile:
    newfile.write(content)

您现在可以阅读myplaintext_fixed.txt" with the PlaintextCorpusReader`,一切都会按预期进行.

You can now read myplaintext_fixed.txt" with thePlaintextCorpusReader`, and everything will work as expected.

这篇关于NLTK语料库阅读器段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆