以最明显的方式排列字母? [英] Arranging letters in the most pronounceable way?
问题描述
我有一个包含一些字符的字符串,并且我正在寻找这些字符的组织,以使其尽可能最明显。
I have a string with some characters, and I'm looking for the organization of those characters such that it's the most pronounceable possible.
例如,如果我有字母 ascrlyo,则有些安排会比其他安排更明显。以下内容可能会获得高分:
For example, if I have the letters "ascrlyo", there are some arrangements that would be more pronounceable than others. The following may get a "high score":
scaroly
crasoly
scaroly crasoly
以下各项可能会获得较低的分数:
Where as the following may get a low score:
很可能是
yrlcsoa
oascrly yrlcsoa
有没有可以使用的简单算法?还是更好的是,一个实现此目标的Python功能?
Is there a simple algorithm I can use? Or better yet, a Python functionality that achieves this?
谢谢!
推荐答案
从解决一个简单的问题开始:给定的单词可以发音吗?
Start by solving a simpler problem: is a given word pronounceable?
机器学习监督学习在这里可能是有效的。在字典词和加扰词的训练集上训练二进制分类器(假定加扰的词都是不发音的)。对于功能,我建议计算双字母组和三字母组。我的推论:词典单词中很少出现诸如 tns和 srh之类的不可发音的三字组,即使每个字母都很常见。
Machine learning 'supervised learning' could be effective here. Train a binary classifier on a training set of dictionary words and scrambled words (assume the scrambled words are all unpronounceable). For features, I suggest counting bigrams and trigrams. My reasoning: unpronounceable trigrams such as 'tns' and 'srh' are rare in dictionary words, even though the individual letters are each common.
这个想法是受过训练的算法将学会将带有任何稀有三字母组的单词分类为不可发音,而将仅有普通三字母组的单词分类为可发音。
The idea is that the trained algorithm will learn to classify words with any rare trigrams as unpronounceable, and words with only common trigrams as pronounceable.
这里是scikit-learn http的实现://scikit-learn.org/
Here's an implementation with scikit-learn http://scikit-learn.org/
import random
def scramble(s):
return "".join(random.sample(s, len(s)))
words = [w.strip() for w in open('/usr/share/dict/words') if w == w.lower()]
scrambled = [scramble(w) for w in words]
X = words+scrambled
y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([
('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
('clf', MultinomialNB())
])
text_clf = text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, predicted))
它的准确率达到92%。鉴于可发音性是主观的,所以它可能会变得更好。
It scores 92% accuracy. Given pronounceability is subjective anyway, this might be as good as it gets.
precision recall f1-score support
scrambled 0.93 0.91 0.92 52409
word 0.92 0.93 0.93 52934
avg / total 0.92 0.92 0.92 105343
它与您的示例相同:
>>> text_clf.predict("scaroly crasoly oascrly yrlcsoa".split())
['word', 'word', 'unpronounceable', 'unpronounceable']
好奇的是,这里有10个扰乱的单词,其发音可分类:
For the curious, here are 10 scrambled words it classifies pronounceable:
- moro garapm ocenfir onerixoatteme arckinbo raetomoporyo bheral accrene cchmanie
suroatipsheq
最后10个字典单词被误认为是不可发音的:
And finally 10 dictionary words misclassified as unpronouncable:
- ilch tohubohu角膜半起搏药锰铁矿lynnhaven残酷确保零食
这篇关于以最明显的方式排列字母?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!