Java NLP:标记文本时提取索引 [英] Java NLP: Extracting Indicies When Tokenizing Text

查看:64
本文介绍了Java NLP:标记文本时提取索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当标记文本字符串时,我需要提取标记词的索引.例如,给定:

When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given:

"Mary didn't kiss John"

我需要类似的东西:

[(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)]

0、5、8、12和17对应于令牌开始处的索引(在原始字符串中).我不能仅仅依靠空格,因为有些单词变成了2个标记.此外,我不能仅在字符串中搜索令牌,因为该单词可能会出现多次.

Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times.

一个巨大的障碍是我正在处理脏"文字.这是语料库及其标记化的真实示例:

One giant obstacle is that I'm working with "dirty" text. Here is a real example from the corpus, and its tokenization:

字符串:

The child some how builds a boaty  c capable of getting scrtoacross the sea, even after findingovercoming many treachrous rous obsittalcles.

代币:

The, child, some, how, builds, a, boaty, , , c, , capable, of, getting, scrto, , across, the, sea, ,, even, after, finding, , , , , overcoming, many, treachrous, rous, obsittalcles, .

我目前正在使用OpenNLP对文本进行标记化,但是对于要使用哪种API进行标记化却含糊不清.但是,它确实必须是Java,因此(不幸的是)Python的NLTK不在图片中.

I'm currently using OpenNLP to tokenize the text, but am ambivalent about which API to utilize for tokenization. It does need to be Java, though, so (unfortunately) Python's NLTK is out of the picture.

任何想法将不胜感激!谢谢!

Any ideas would be greatly appreciated! Thanks!

推荐答案

OpenNLP将使用方法 Tokenizer.tokenizePos(String s)返回偏移量,请参见

OpenNLP will return the offsets using the method Tokenizer.tokenizePos(String s), see the OpenNLP API for TokenizerME as an example for one the implemented tokenizers. Each Span returned contains the start and end positions of the token.

您是否决定使用UIMA确实是一个单独的问题,但是OpenNLP确实为使用 tokenizePos()的令牌生成器提供了UIMA注释器.但是,如果您只想对字符串进行标记化,则UIMA绝对是多余的...

Whether you decide to use UIMA is really a separate question, but OpenNLP does provide UIMA annotators for their tokenizers that use tokenizePos(). However, if you just want to tokenize a string, UIMA is definitely overkill...

这篇关于Java NLP:标记文本时提取索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆