Stanford coreNLP-忽略撇号的拆分词 [英] Stanford coreNLP - split words ignoring apostrophe

查看:342
本文介绍了Stanford coreNLP-忽略撇号的拆分词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Stanford coreNLP将句子拆分成单词. 我对包含单引号的单词有疑问.

I'm trying to split a sentence into words using Stanford coreNLP . I'm having problem with words that contains apostrophe.

例如,句子: 我今年24岁.

For example, the sentence: I'm 24 years old.

这样分割: [我] ['m] [24] [年] [旧]

Splits like this: [I] ['m] [24] [years] [old]

是否可以使用Stanford coreNLP像这样拆分它?: [我] [24] [岁] [岁]

Is it possible to split it like this using Stanford coreNLP?: [I'm] [24] [years] [old]

我尝试使用tokenize.whitespace,但是它不会在其他标点符号上分开,例如:'?'和','

I've tried using tokenize.whitespace, but it doesn't split on other punctuation marks like: '?' and ','

推荐答案

当前没有.随后的Stanford CoreNLP处理工具都使用 Penn Treebank令牌化,它将收缩分解为两个标记(将我是"简化为我是",方法是将其设为两个单词" [I] ['m]).听起来您想要其他类型的令牌化.

Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.

虽然有一些标记化选项,但没有一个可以更改此选项,并且随后的工具(例如POS标记器或解析器)在不分割收缩的情况下将无法正常工作.您可以在令牌生成器中添加这样的选项,从而更改(删除)REDAUX和SREDAUX尾随上下文的处理.

While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.

您也可以按照@dhg的建议,通过后期处理加入收缩,但是您希望在"if"中更仔细地进行收缩,以使其不加引号.

You can also join contractions via post processing as @dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.

这篇关于Stanford coreNLP-忽略撇号的拆分词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆