将文档拆分为段落 [英] Split Documents into Paragraphs

查看:138
本文介绍了将文档拆分为段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的PDF文件.我使用 Apache Tika 将它们转换为文本,现在我想将它们拆分为段落.我不能使用正则表达式,因为文本转换使段落之间的区分变得不可能:有些文档在段落之间使用\n的标准方式,但是有些文档在同一段落中的行之间使用\n,然后使用双\n在段落之间(使用Tika转换为HTML而不是文本无济于事).

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).

Python的NLTK书有一种使用机器学习拆分句子的方式,所以我想尝试与段落类似的方法,但是找不到相应的训练数据.

Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.

是否有针对此的培训数据?我应该尝试一些可能有用的复杂正则表达式吗?

Is there training data for that? should I try some complex regular expression that might work?

推荐答案

您说

有些文档在段落之间使用\ n的标准方式,但是有些文档在同一段落的行之间使用\ n,然后在段落之间使用双\ n

some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs

因此,我将对所有文件进行预处理,以使用段落之间的双换行符进行检测.需要删除具有双\n的文件的所有单换行字符,并将所有双换行减少为单行.

so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.

然后,您可以将所有文件传递到下一个阶段,在此阶段您可以使用单个\n字符来检测段落.

You can then pass all the files to the next stage where you detect paragraphs using a single \n character.

这篇关于将文档拆分为段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆