搞清楚在哪里加标点符号的不好的用户生成的内容? [英] Figuring out where to add punctuation in bad user generated content?

查看:163
本文介绍了搞清楚在哪里加标点符号的不好的用户生成的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法使用自然语言处理的或现有的库添加缺少标点错误的用户生成的内容?

Is there a way to use NLP or an existing library to add missing punctuation to bad user generated content?

例如,此字符串:

今天是星期二我去上班,周一周五关闭

将变成:

今天是星期二。我到周一上班。周五关闭。

推荐答案

我简单地解决这个问题打了(只有部分成功)。

I've played briefly with this problem (with only partial success).

您例如文本丢失只时段;如果这是唯一的标点符号你感兴趣的恢复,@看着句子范围消歧技术拉胡尔的建议可能是合适的。如果你希望恢复其他标点符号,以及,你可能需要的东西有点不同。例如,您可能希望改变:

Your example text is missing only periods; if that's the only punctuation you're interested in restoring, @Rahul's suggestion of looking at sentence boundary disambiguation techniques is probably appropriate. If you're hoping to restore other punctuation as well, you might need something a little different. For example, you might want to transform:

我还在忙,但生病打电话的时候,我能感觉你任何比昨天好

Im still busy but ill call you when I can Feeling any better than yesterday

我还是很忙,我的时候,我可以给你打电话。任何感觉比昨天好?

I'm still busy but I'll call you when I can. Feeling any better than yesterday?

请注意,这两个句子是相对的语法(这可能会大大影响你的标点符号恢复系统的精度)。

Note that both sentences are relatively grammatical (which might greatly affect the accuracy of your punctuation restoration system).

我的建议是,培养一个字符的n-gram模型,并用它来得分标点符号增加了莱文斯坦距离计算。 LingPipe的拼写校正教程是一个良好的开端。他们的编辑距离计算器,很容易定制只允许插入和(你的情况),只有你感兴趣的注意具体的标点字符插入。我估计,8-12个字符的语言模型会可能是合适的在此情况下;你可以去一个稍微大一点,但我的猜测是,你可能不会看到超出该范围的巨大改善。

My recommendation is to train a character-based n-gram model, and use it to score punctuation additions in a Levenshtein distance calculation. LingPipe's Spelling-correction Tutorial is a good place to start. Their edit-distance calculator is easy to customize to only allow insertions, and (in your case), insertions of only the specific punctuation characters you're interested in. Note: I'd estimate that a language model of 8-12 characters would probably be appropriate in this case; you could go a little larger, but my guess is you're not likely to see huge improvements beyond that range.

和往常一样训练的NLP模型时,你的表现将会改善,如果你可以训练你的模型上,你的目标域相当紧密匹配的文本。如果你没有足够的域数据,你可以用一个域内设置较小的结合了大的标准语料库(如​​通社文),而且有些upweight您在域数据(只是复制了N次,并与洗牌随机外域的文本通常工作pretty的好)。

As always when training any NLP model, your performance will improve if you can train your model on text that matches your target domain fairly closely. If you don't have enough in-domain data, you could combine a large standard corpus (e.g. newswire text) with a smaller in-domain set, and upweight your in-domain data somewhat (just replicating it n times and shuffling randomly with the out-of-domain text often works pretty well).

这篇关于搞清楚在哪里加标点符号的不好的用户生成的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆