将文档拆分为段落 [英] Split Documents into Paragraphs

查看:38
本文介绍了将文档拆分为段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的 PDF 文档.我使用 Apache Tika 将它们转换为文本,现在我想将它们拆分为段落.我无法使用正则表达式,因为文本转换使段落之间的区别变得不可能:有些文档在段落之间具有 \n 的标准方式,但有些文档具有 \n> 在同一段落中的行之间,然后在段落之间使用双 \n(使用 Tika 转换为 HTML 而不是文本无济于事).

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).

Python 的 NLTK 书籍有一种使用机器学习拆分句子的方法,所以我想尝试一些类似的段落,但我找不到相关的训练数据.

Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.

有这方面的训练数据吗?我应该尝试一些可能有效的复杂正则表达式吗?

Is there training data for that? should I try some complex regular expression that might work?

推荐答案

你说

有些文档的标准方式是在段落之间使用\n,但有些文档在同一段落的行之间使用\n,然后在段落之间使用双\n

some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs

所以我会预处理所有文件以使用段落之间的双换行符来检测.带有双\n的文件需要去掉所有单换行符,将双换行全部减为单行.

so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.

然后您可以将所有文件传递到使用单个 \n 字符检测段落的下一阶段.

You can then pass all the files to the next stage where you detect paragraphs using a single \n character.

这篇关于将文档拆分为段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆