将文档拆分为段落 [英] Split Documents into Paragraphs

查看：38 发布时间：2021/11/14 23:48:49 python regex machine-learning apache-tika

本文介绍了将文档拆分为段落的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量的 PDF 文档.我使用 Apache Tika 将它们转换为文本，现在我想将它们拆分为段落.我无法使用正则表达式，因为文本转换使段落之间的区别变得不可能:有些文档在段落之间具有 \n 的标准方式，但有些文档具有 \n> 在同一段落中的行之间，然后在段落之间使用双 \n(使用 Tika 转换为 HTML 而不是文本无济于事).

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).

Python 的 NLTK 书籍有一种使用机器学习拆分句子的方法，所以我想尝试一些类似的段落，但我找不到相关的训练数据.

Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.

有这方面的训练数据吗?我应该尝试一些可能有效的复杂正则表达式吗?

Is there training data for that? should I try some complex regular expression that might work?

将文档拆分为段落 [英] Split Documents into Paragraphs

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

将文档拆分为段落 [英] Split Documents into Paragraphs

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭