斯坦福 coreNLP - 拆分单词忽略撇号 [英] Stanford coreNLP - split words ignoring apostrophe

查看:27
本文介绍了斯坦福 coreNLP - 拆分单词忽略撇号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用斯坦福 coreNLP 将句子拆分为单词.我在处理包含撇号的单词时遇到问题.

I'm trying to split a sentence into words using Stanford coreNLP . I'm having problem with words that contains apostrophe.

例如这句话:我今年 24 岁.

For example, the sentence: I'm 24 years old.

拆分如下:[I] ['m] [24] [岁] [老]

Splits like this: [I] ['m] [24] [years] [old]

是否可以使用斯坦福 coreNLP 像这样拆分它?:[我] [24] [岁] [老]

Is it possible to split it like this using Stanford coreNLP?: [I'm] [24] [years] [old]

我尝试过使用 tokenize.whitespace,但它不会在其他标点符号上拆分,例如:'?'和','

I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','

推荐答案

目前,没有.随后的斯坦福 CoreNLP 处理工具都使用 Penn Treebank tokenization,将收缩拆分为两个标记(通过将我是"变成两个词"[I] ['m],将我是"视为我是"的简化形式).听起来您想要一种不同类型的标记化.

Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.

虽然有一些标记化选项,但没有一个可以改变这一点,后续工具(如 POS 标记器或解析器)会在不拆分收缩的情况下工作得很糟糕.您可以向分词器添加这样一个选项,更改(删除)REDAUX 和 SREDAUX 尾随上下文的处理方式.

While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.

您也可以按照@dhg 的建议通过后期处理加入收缩,但您希望在if"中更仔细地进行,这样它就不会加入引号.

You can also join contractions via post processing as @dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.

这篇关于斯坦福 coreNLP - 拆分单词忽略撇号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆