Rails gem将段落分成一系列句子 [英] Rails gem to break a paragraph into series of sentences

查看:137
本文介绍了Rails gem将段落分成一系列句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将一个段落分成一系列句子,这样每个句子组都停留在N个字符之下。如果单个句子长于N,它应该被分成标点符号或空格作为分隔符。



例如,如果N = 50,那么以下字符串


Lorem ipsum,consectetur elit。Donec ut ligula。Sed acumsan posuere tristique。Sed et tristique sem。Aenean sollicitudin,sapien sodales


会变成


[Lorem ipsum,consectetur elit。Donec ut ligula。,Sed acumsan posuere tristique。,Sed et tristique sem。,Aenean sollicitudin,sapien sodales elementum blandit。,Fusce urna libero blandit eu aliquet ac rutrum vel,tortor。]


可以帮助我实现这一目标的宝石?我查看了 html_slicer ,但我不确定它可以处理上述示例。

解决方案

有两个非平凡的任务可以实现您的目标:


  1. 将字符串拆分为句子

  2. 和单词包装每个句子,特别注意标点符号。






我认为第一个从头开始并不容易,所以您最好的选择可能就是使用自然语言处理库,前提是您的第三方语言处理服务没有这样的功能。我不知道任何rails gem来满足你的要求。

这里只是一个玩具的例子,用 stanford-core-nlp

  require'stanford-core-nlp'
text =Lorem ipsum,consectetur elit。Donec ut ligula。Sed acumsan posuere tristique。Sed et tristique sem。Aenean sollicitudin,sapien sodales elementum blandit。Fusce urna libero blandit eu aliquet ac rutrum vel tortor。
pipeline = StanfordCoreNLP.load(:tokenize,:ssplit)
a = StanfordCoreNLP :: Annotation.new(text)
pipeline.annotate(a)
sentenses = a.get( :句子).to_a.map&:to_s#如果需要一个句子字符串数组,可以用to_s映射。
#=> Lorem ipsum,consectetur elit。,Donec ut ligula。,Sed acumsan posuere tristique。,Sed et tristique sem。,Aenean sollicitudin,sapien sodales elementum blandit。,Fusce urna libero blandit eu ]






第二个问题与单词包装相似,如果它确实是一个单词包装问题,应该使用现有的实现轻松解决,例如ActionView :: Helpers :: TextHelper.word_wrap。
但是,对于标点符号还有一个额外的要求。我不知道任何现有的实现,以实现你的目标完全相同。也许你必须想出你自己的解决方案。



我唯一的想法是首先对每个句子进行单词包装,然后用标点符号分隔每行,然后加入再一次,但长度有限。我想知道这是否会工作。


I'm trying to split a paragraph into series of sentences such that each sentence group stays under N characters. In case of a single sentence that is longer than N, it should be split into chunks with punctuation marks or spaces as separators.

E.g., if N = 50, then the following string

"Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."

would become

["Lorem ipsum, consectetur elit. Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin,", "sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel", "tortor."]

Are there any rails gems that could help me to achieve this? I looked at html_slicer, but I'm not sure it can handle the example above.

解决方案

There are two non-trivial tasks to achieve what you are after:

  1. splitting a string into sentences
  2. and word-wrapping each sentence with extra care for punctuation.


I think the first one is not easy to implement from scratch so your best bet might just be to use natural language processing libraries provided that your "third-party language processing service" doesn't have such a feature. I don't know any "rails gem" to meet your requirement.

Here is just a toy example of splitting a string into sentences using stanford-core-nlp.

require 'stanford-core-nlp'
text = "Lorem ipsum, consectetur elit. Donec ut ligula. Sed acumsan posuere tristique. Sed et tristique sem. Aenean sollicitudin, sapien sodales elementum blandit. Fusce urna libero blandit eu aliquet ac rutrum vel tortor."
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit)
a = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(a)
sentenses = a.get(:sentences).to_a.map &:to_s # Map with to_s if you want an array of sentence string.
# => ["Lorem ipsum, consectetur elit.", "Donec ut ligula.", "Sed acumsan posuere tristique.", "Sed et tristique sem.", "Aenean sollicitudin, sapien sodales elementum blandit.", "Fusce urna libero blandit eu aliquet ac rutrum vel tortor."]


The second problem is similar to word-wrapping and if it exactly were a word-wrapping problem, it should be easily solved using existing implementations like ActionView::Helpers::TextHelper.word_wrap. However, there is an extra requirement concerning punctuations. I don't know any existing implementation to achieve exactly the same goal of yours. Maybe you have to come up with your own solution.

My only idea is to firstly word-wrap each sentence, secondly split each line with a punctuation and then join the pieces again but with limitation on length. I wonder if this would work though.

这篇关于Rails gem将段落分成一系列句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆