Rails gem 将一个段落分成一系列句子 [英] Rails gem to break a paragraph into series of sentences

查看:18
本文介绍了Rails gem 将一个段落分成一系列句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个段落拆分为一系列句子,以使每个句子组保持在 N 个字符之下.单句长于N的情况下,应以标点符号或空格作为分隔符分割成块.

I'm trying to split a paragraph into series of sentences such that each sentence group stays under N characters. In case of a single sentence that is longer than N, it should be split into chunks with punctuation marks or spaces as separators.

例如,如果 N = 50,则以下字符串

E.g., if N = 50, then the following string

Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."

"Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."

会变成

["Lorem ipsum, consectetur elit. Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin", "sapien sodales elementum blandit.", "Fusce urnalibero blandit eu aliquet ac rutrum vel", "tortor."]

["Lorem ipsum, consectetur elit. Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin,", "sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel", "tortor."]

是否有任何 Rails 宝石可以帮助我实现这一目标?我查看了 html_slicer,但我不确定它是否可以处理上面的示例.

Are there any rails gems that could help me to achieve this? I looked at html_slicer, but I'm not sure it can handle the example above.

推荐答案

有两个不平凡的任务来实现你所追求的:

There are two non-trivial tasks to achieve what you are after:

  1. 将字符串拆分成句子
  2. 并特别注意标点符号的每个句子.

<小时>

我认为第一个从头开始并不容易实现,因此最好的选择可能是使用自然语言处理库,前提是您的第三方语言处理服务"没有这样的功能.我不知道有任何rails gem"可以满足您的要求.


I think the first one is not easy to implement from scratch so your best bet might just be to use natural language processing libraries provided that your "third-party language processing service" doesn't have such a feature. I don't know any "rails gem" to meet your requirement.

这只是一个使用 stanford-core-nlp.

require 'stanford-core-nlp'
text = "Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit)
a = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(a)
sentenses = a.get(:sentences).to_a.map &:to_s # Map with to_s if you want an array of sentence string.
# => ["Lorem ipsum, consectetur elit.", "Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin, sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel tortor."]

<小时>

第二个问题类似于自动换行,如果它确实是自动换行问题,那么使用现有的实现(如 ActionView::Helpers::TextHelper.word_wrap)应该很容易解决.但是,对标点符号有额外的要求.我不知道任何现有的实现可以实现与您完全相同的目标.也许你必须想出自己的解决方案.


The second problem is similar to word-wrapping and if it exactly were a word-wrapping problem, it should be easily solved using existing implementations like ActionView::Helpers::TextHelper.word_wrap. However, there is an extra requirement concerning punctuations. I don't know any existing implementation to achieve exactly the same goal of yours. Maybe you have to come up with your own solution.

我唯一的想法是首先将每个句子换行,然后用标点符号分割每一行,然后再将各个部分连接起来,但要限制长度.我想知道这是否可行.

My only idea is to firstly word-wrap each sentence, secondly split each line with a punctuation and then join the pieces again but with limitation on length. I wonder if this would work though.

这篇关于Rails gem 将一个段落分成一系列句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆