Justadistraction:在没有空格的情况下将英语标记化.村上牧羊人 [英] Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

查看:96
本文介绍了Justadistraction:在没有空格的情况下将英语标记化.村上牧羊人的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如果删除空格,会如何用英语(或其他西方语言)标记字符串?

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?

该问题的灵感源于村上小说'舞蹈舞中的绵羊人角色. '

The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'

在小说中,《牧羊人》被翻译成这样的话:

In the novel, the Sheep Man is translated as saying things like:

就像我们说的那样,我们会愿意的.尝试重新连接您,想要什么,"绵羊人说. 但是我们不能一个人做.Yougottaworktoo."

"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

因此,保留了一些标点符号,但不是全部.足以让人类阅读,但有些武断.

So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.

为此构建解析器的策略是什么?字母,音节计数,条件语法,正则表达式前/后正则表达式等的常见组合?

What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?

特别是在python方面,您将如何构建(宽恕)翻译流程?没有要求完整的答案,只是想出更多的思路来解决问题.

Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.

我轻率地问了这个问题,但是我认为这是一个可能会得到一些有趣的答案(nlp/crypto/frequency/social)的问题. 谢谢!

I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers. Thanks!

推荐答案

实际上,大约八个月前,我做了类似的工作.我只是在哈希表中使用了英语单词词典(用于O(1)查找时间).我会一个字一个字母地匹配整个单词.它运作良好,但存在许多歧义. (asshit可以是屁股命中,也可以是狗屎).要解决这些歧义,就需要进行更复杂的语法分析.

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

这篇关于Justadistraction:在没有空格的情况下将英语标记化.村上牧羊人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆