以最明显的方式排列字母? [英] Arranging letters in the most pronounceable way?

查看:81
本文介绍了以最明显的方式排列字母?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些字符的字符串,并且我正在寻找这些字符的组织,以使其尽可能最明显。

I have a string with some characters, and I'm looking for the organization of those characters such that it's the most pronounceable possible.

例如,如果我有字母 ascrlyo,则有些安排会比其他安排更明显。以下内容可能会获得高分:

For example, if I have the letters "ascrlyo", there are some arrangements that would be more pronounceable than others. The following may get a "high score":


scaroly
crasoly

scaroly crasoly

以下各项可能会获得较低的分数:

Where as the following may get a low score:


很可能是
yrlcsoa

oascrly yrlcsoa

有没有可以使用的简单算法?还是更好的是,一个实现此目标的Python功能?

Is there a simple algorithm I can use? Or better yet, a Python functionality that achieves this?

谢谢!

推荐答案

从解决一个简单的问题开始:给定的单词可以发音吗?

Start by solving a simpler problem: is a given word pronounceable?

机器学习监督学习在这里可能是有效的。在字典词和加扰词的训练集上训练二进制分类器(假定加扰的词都是不发音的)。对于功能,我建议计算双字母组和三字母组。我的推论:词典单词中很少出现诸如 tns和 srh之类的不可发音的三字组,即使每个字母都很常见。

Machine learning 'supervised learning' could be effective here. Train a binary classifier on a training set of dictionary words and scrambled words (assume the scrambled words are all unpronounceable). For features, I suggest counting bigrams and trigrams. My reasoning: unpronounceable trigrams such as 'tns' and 'srh' are rare in dictionary words, even though the individual letters are each common.

这个想法是受过训练的算法将学会将带有任何稀有三字母组的单词分类为不可发音,而将仅有普通三字母组的单词分类为可发音。

The idea is that the trained algorithm will learn to classify words with any rare trigrams as unpronounceable, and words with only common trigrams as pronounceable.

这里是scikit-learn http的实现://scikit-learn.org/

Here's an implementation with scikit-learn http://scikit-learn.org/

import random
def scramble(s):
    return "".join(random.sample(s, len(s)))

words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]

X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
    ])

text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)

from sklearn import metrics
print(metrics.classification_report(y_test, predicted))

它的准确率达到92%。鉴于可发音性是主观的,所以它可能会变得更好。

It scores 92% accuracy. Given pronounceability is subjective anyway, this might be as good as it gets.

                 precision    recall  f1-score   support

      scrambled       0.93      0.91      0.92     52409
           word       0.92      0.93      0.93     52934

    avg / total       0.92      0.92      0.92    105343

它与您的示例相同:

>>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
['word', 'word', 'unpronounceable', 'unpronounceable']

好奇的是,这里有10个扰乱的单词,其发音可分类:

For the curious, here are 10 scrambled words it classifies pronounceable:


  • moro garapm ocenfir onerixoatteme arckinbo raetomoporyo bheral accrene cchmanie
    suroatipsheq

最后10个字典单词被误认为是不可发音的:

And finally 10 dictionary words misclassified as unpronouncable:


  • ilch tohubohu角膜半起搏药锰铁矿lynnhaven残酷确保零食

这篇关于以最明显的方式排列字母?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆