将文档拆分为段落 [英] Split Documents into Paragraphs

查看：138 发布时间：2020/5/4 10:27:00 python regex machine-learning apache-tika

本文介绍了将文档拆分为段落的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大量的PDF文件.我使用 Apache Tika 将它们转换为文本，现在我想将它们拆分为段落.我不能使用正则表达式，因为文本转换使段落之间的区分变得不可能:有些文档在段落之间使用\n的标准方式，但是有些文档在同一段落中的行之间使用\n，然后使用双\n在段落之间(使用Tika转换为HTML而不是文本无济于事).

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).

Python的NLTK书有一种使用机器学习拆分句子的方式，所以我想尝试与段落类似的方法，但是找不到相应的训练数据.

Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.

是否有针对此的培训数据?我应该尝试一些可能有用的复杂正则表达式吗?

Is there training data for that? should I try some complex regular expression that might work?

将文档拆分为段落 [英] Split Documents into Paragraphs

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

将文档拆分为段落 [英] Split Documents into Paragraphs

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭