文本简化工具(Java) [英] Tools for text simplification (Java)

查看:157
本文介绍了文本简化工具(Java)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Java进行文本简化的最佳工具是什么?

What is the best tool that can do text simplification using Java?

以下是文本简化的示例:

Here is an example of text simplification:

John, who was the CEO of a company, played golf.
                       ↓
John played golf. John was the CEO of a company.


推荐答案

我认为你的问题是转换复杂或复合句成简单句。
基于文献句子类型,简单句子是由一个独立的条款构成的。复合词和复句是由至少两个子句构成的。此外,子句必须有主语和动词。

所以你的任务是将句子分成构成你句子的句子。

I see your problem as a task of converting complex or compound sentence into simple sentences. Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.

从斯坦福CoreNLP解析依赖性是将复合句和复句分成简单句子的完美工具。您可以尝试在线演示

来自您的样本句,我们将得到斯坦福类型依赖(SD)表示法的解析结果,如下所示:

Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:


nsubj(CEO-6,John-1)

nsubj(播放-11,John-1)

警察(CEO-6,was-4)

det(CEO-6) ,-5)

rcmod(John-1,CEO-6)

det(company-9,a-8)

prep_of(CEO-6,company-9)

root(ROOT-0,播放-11)

dobj(播放-11) ,golf-12)

可以从关系(在SD中)确定哪个类别是主题,例如 nsubj nsubjpass 。请参阅斯坦福依赖关系手册

基本条款可以从 head 作为动词部分,依赖作为主题部分。从上面的SD,有两个基本条款,即

A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.


  • John CEO

  • John扮演

获得基本条款后,您可以添加另一部分,使您的条款成为完整而有意义的句子。为此,请参阅斯坦福依赖手册

After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.

顺便提一下,您的问题可能与相关从句子中查找有意义的子句

By the way, your question might be related with Finding meaningful sub-sentences from a sentence

一旦你得到一对主语一个动词,即 nsubj(CEO-6,John-1) ,获取所有具有该依赖项链接的依赖项,除了该类别所依赖的任何依赖项,然后从这些依赖项中提取唯一的单词。

Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.

根据例子, nsubj(CEO-6,John-1) ,如果您开始从 John-1 开始,您将获得 nsubj(播放-11,John- 1) 但你应该忽略它,因为它的类别是主题。

Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.

下一步是从 <$ c遍历$ c> CEO-6 部分。你会得到

Next step is traversing from CEO-6 part. You'll get


警察(CEO-6,was-4) < br>
det(CEO-6,the-5)

rcmod(John-1,CEO- 6)

prep_of(CEO-6,company-9)

从上面的结果中,你有了新的依赖关系来遍历(即找到另一个依赖 was-4,-5,company-9 在head或dependent中。

现在您的依赖项是

From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
Now your dependencies are


警察(CEO-6,was-4)

det(CEO-6,the-5)

rcmod(John-1,CEO-6)

prep_of(CEO- 6,company-9)

det(company-9,a-8)

cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)

在此步骤中,您已完成遍历链接到 nsubj(CEO-6)的所有家属,John-1) 。接下来,从所有头部和从属词中提取单词,然后根据附加到这些单词的数字按升序排列单词。这个数字表示原始句子中的单词顺序。

In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.


John是公司的首席执行官

我们的新句子缺少一部分,即 即可。此部分隐藏在 prep_of(CEO-6,company-9) 中。如果你阅读斯坦福依赖手册,有两种 SD ,折叠和非折叠。请阅读它们以了解隐藏 的原因以及如何获取此隐藏部分的单词顺序。

Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.

采用相同的方法,你会得到第二句话

With same approach, you'll get second sentence


John打高尔夫

John played golf

这篇关于文本简化工具(Java)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆