如何将文本解析成句子 [英] How to parse text into sentences

查看:121
本文介绍了如何将文本解析成句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图将一个段落分解成句子。这是我到目前为止的代码:

I'm trying to break up a paragraph into sentences. Here is my code so far:

import java.util.*;

public class StringSplit {
 public static void main(String args[]) throws Exception{
     String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales.";
     String[] sentences = testString.split("[\\.\\!\\?]");
     for (int i=0;i<sentences.length;i++){  
         System.out.println(i);
      System.out.println(sentences[i]);  
     }  
 }
}

发现了两个问题:


  1. 代码在句点(。)符号的任何时候分割,即使它实际上是一个句子。如何防止这种情况?

  2. 每个分割的句子都以空格开头。如何删除冗余空间?


推荐答案

您提到的问题是NLP(自然语言处理)问题。编写原始规则引擎很好,但它可能无法扩展以支持完整的英文文本。

The problem you mentioned is a NLP (Natural Language Processing) problem. It is fine to write a crude rule engine but it might not scale up to support full english text.

要获得更深入的见解和java库,请查看此链接 http://nlp.stanford.edu/software/lex-parser.shtml http://nlp.stanford.edu:8080/parser/index.jsp 和类似的问题 ruby​​ 语言如何解析段落将文字翻译成句子? (可以在Ruby中使用)

To have a deeper insight and a java library check out this link http://nlp.stanford.edu/software/lex-parser.shtml , http://nlp.stanford.edu:8080/parser/index.jsp and similar question for ruby language How do you parse a paragraph of text into sentences? (perferrably in Ruby)

例如:
文本 -

for example : The text -


谈判结果至关重要,因为总统乔治·W·布什于12月31日签署成为法律的现行税收水平b $ b除非
国会采取行动,税率几乎为b $ b b所有支付所得税
的美国人将在1月1日上涨。这可能会影响
经济增长甚至假期
销售。


/ DT结果/ N / / DT
谈判/ NNS是/ VBZ至/ JJ,/,
因为/ IN / DT当前/ JJ税/ NN
等级/ NNS签署/ VBN进入/ IN法律/ NN
/ IN总统/ NNP George / NNP W./NNP
布什/ NNP到期/ VBP / RP 12月/ NNP
31 / CD ./。除非/ IN国会/ NNP
行为/ VBZ,/,税/ NN费率/ NNS / IN
虚拟/ RB所有/ RB美国人/ NNPS
谁/ WP支付/ VBP收入/ NN税/ NNS
将/ MD上涨/ VB上/ IN 1月/ NNP 1 / CD
./。 / DT可能/ MD影响/ VB
经济/ JJ增长/ NN和/ CC甚至/ RB
假日/ NN销售/ NNS ./。解析

The/DT outcome/NN of/IN the/DT negotiations/NNS is/VBZ vital/JJ ,/, because/IN the/DT current/JJ tax/NN levels/NNS signed/VBN into/IN law/NN by/IN President/NNP George/NNP W./NNP Bush/NNP expire/VBP on/RP Dec./NNP 31/CD ./. Unless/IN Congress/NNP acts/VBZ ,/, tax/NN rates/NNS on/IN virtually/RB all/RB Americans/NNPS who/WP pay/VBP income/NN taxes/NNS will/MD rise/VB on/IN Jan./NNP 1/CD ./. That/DT could/MD affect/VB economic/JJ growth/NN and/CC even/RB holiday/NN sales/NNS ./. Parse

检查它如何区分完整止损(。)和12月31日之后的期间......

Check how it has distinguished the full stop (.) and the period after Dec. 31 ...

这篇关于如何将文本解析成句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆