pronounceability算法 [英] pronounceability algorithm

查看:276
本文介绍了pronounceability算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在努力寻找/创建一个算法可以确定随机5字母组合的pronounceability。

I am struggling to find/create an algorithm that can determine the pronounceability of random 5 letter combinations.

我到目前为止发现的最接近的是从这个3岁的StackOverflow的主题:

The closest thing I've found so far is from this 3 year old StackOverflow thread:

衡量一个字的pronounceability?

<?php
// Score: 1
echo pronounceability('namelet') . "\n";

// Score: 0.71428571428571
echo pronounceability('nameoic') . "\n";

function pronounceability($word) {
    static $vowels = array
        (
        'a',
        'e',
        'i',
        'o',
        'u',
        'y'
        );

    static $composites = array
        (
        'mm',
        'll',
        'th',
        'ing'
        );

    if (!is_string($word)) return false;

    // Remove non letters and put in lowercase
    $word = preg_replace('/[^a-z]/i', '', $word);
    $word = strtolower($word);

    // Special case
    if ($word == 'a') return 1;

    $len = strlen($word);

    // Let's not parse an empty string
    if ($len == 0) return 0;

    $score = 0;
    $pos = 0;

    while ($pos < $len) {
        // Check if is allowed composites
        foreach ($composites as $comp) {
                $complen = strlen($comp);

                if (($pos + $complen) < $len) {
                        $check = substr($word, $pos, $complen);

                        if ($check == $comp) {
                                $score += $complen;
                                $pos += $complen;
                                continue 2;
                        }
                }
        }

        // Is it a vowel? If so, check if previous wasn't a vowel too.
        if (in_array($word[$pos], $vowels)) {
                if (($pos - 1) >= 0 && !in_array($word[$pos - 1], $vowels)) {
                        $score += 1;
                        $pos += 1;
                        continue;
                }
        } else { // Not a vowel, check if next one is, or if is end of word
                if (($pos + 1) < $len && in_array($word[$pos + 1], $vowels)) {
                        $score += 2;
                        $pos += 2;
                        continue;
                } elseif (($pos + 1) == $len) {
                        $score += 1;
                        break;
                }
        }

        $pos += 1;
    }

    return $score / $len;
}
?>

...但它是远远不够完善,给一些比较奇怪的误报:

... but it is far from perfect, giving some rather strange false positives:

使用此功能,所有下列率作为拼读,(以上7/10)

Using this function, all of the following rate as pronounceable, (above 7/10)

  • ZTEDA
  • LLFDA
  • MMGDA
  • THHDA
  • RTHDA
  • XYHDA
  • VQIDA

可有人比我聪明tweek这种算法也许这样:

Can someone smarter than me tweek this algorithm perhaps so that:

    随后或preceeded由一个时
  • MM,LL,和TH是唯一有效的 元音?
  • 在3个以上的辅音一排是一个没有没有,(除了当第一或 最后就是一个'R'或'L')
  • 在任何其他改进你能想到的...
  • 'MM', 'LL', and 'TH' are only valid when followed or preceeded by a vowel?
  • 3 or more consonants in a row is a no-no, (except when the first or last is an 'R' or 'L')
  • any other refinements you can think of...

(我已经做了研究相当数量/谷歌搜索,这似乎是主要的pronounceability功能,每个人都已经被引用/使用在过去的3年,所以我敢肯定,一个更新,更精版本将通过更广泛的社区pciated AP $ P $,不只是我!)。

推荐答案

根据对所链接的问题使用的字母马尔可夫模型的建议

Based on a suggestion on the linked question to "Use a Markov model on letters"

使用一个马尔可夫模型(上字母,没有的话,当然)。一个词的概率是为了便于发音的pretty的很好的替代。

Use a Markov model (on letters, not words, of course). The probability of a word is a pretty good proxy for ease of pronunciation.

我想我会尝试一下,并取得了一些成功。

I thought I would try it out and had some success.

我复制了真实的5个字母的单词列表到一个文件中,作为我的数据集(的此处 ...嗯,其实这里)。

I copied a list of real 5-letter words into a file to serve as my dataset (here...um, actually here).

然后我用一个隐马尔可夫模型(基于一克,双克,和三克),以predict怎么可能一个目标词会出现在该数据集。

Then I use a Hidden Markov model (based on One-grams, Bi-grams, and Tri-grams) to predict how likely a target word would appear in that dataset.

(较好的结果可能与某种拼音为一体的步骤之一来实现。)

(Better results could be achieved with some sort of phonetic transcription as one of the steps.)

首先,我计算数据集中的字符序列的概率。

First, I calculate the probabilities of character sequences in the dataset.

例如,如果A出现50次,并有250只在数据集中的字符,然后'A'有一个二百五十零分之五十零或.2概率。

For example, if 'A' occurs 50 times, and there is only 250 characters in the dataset, then 'A' has a 50/250 or .2 probability.

执行相同的双字母组AB,交流,...

Do the same for the bigrams 'AB', 'AC', ...

做同样的卦ABC,ABD,...

Do the same for the trigrams 'ABC', 'ABD', ...

基本上,我的分数单词ABCDE的组成如下:

Basically, my score for the word "ABCDE" is composed of:

  • 概率('A')
  • 概率('B')
  • 概率('C')
  • 概率('D')
  • 概率('E')
  • 概率('AB')
  • 的概率(BC)
  • 的概率(CD)
  • 概率('德')
  • 概率('ABC')
  • 概率('BCD')
  • 的概率(CDE)
  • prob( 'A' )
  • prob( 'B' )
  • prob( 'C' )
  • prob( 'D' )
  • prob( 'E' )
  • prob( 'AB' )
  • prob( 'BC' )
  • prob( 'CD' )
  • prob( 'DE' )
  • prob( 'ABC' )
  • prob( 'BCD' )
  • prob( 'CDE' )

您可以乘所有这些合力得到出现在数据集中,目标字的概率估计(但是这是非常小的)。

You could multiply all of these together to get the estimated probability of the target word appearing in the dataset, (but that is very small).

所以取而代之,我们把每一个日志,并把它们相加。

So instead, we take the logs of each and add them together.

现在我们有一个得分而估计怎么可能我们的目标词会出现在数据集中。

Now we have a score which estimates how likely our target word would appear in the dataset.

我有codeD这是C#,发现分数大于负160 $​​ P $ ptty的好。

I have coded this is C#, and find that a score greater than negative 160 is pretty good.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace Pronouncability
{

class Program
{
    public static char[] alphabet = new char[]{ 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z' };

    public static List<string> wordList = loadWordList(); //Dataset of 5-letter words

    public static Random rand = new Random();

    public const double SCORE_LIMIT = -160.00;

    /// <summary>
    /// Generates random words, until 100 of them are better than
    /// the SCORE_LIMIT based on a statistical score. 
    /// </summary>
    public static void Main(string[] args)
    {
        Dictionary<Tuple<char, char, char>, int> trigramCounts = new Dictionary<Tuple<char, char, char>, int>();

        Dictionary<Tuple<char, char>, int> bigramCounts = new Dictionary<Tuple<char, char>, int>();

        Dictionary<char, int> onegramCounts = new Dictionary<char, int>();

        calculateProbabilities(onegramCounts, bigramCounts, trigramCounts);

        double totalTrigrams = (double)trigramCounts.Values.Sum();
        double totalBigrams = (double)bigramCounts.Values.Sum();
        double totalOnegrams = (double)onegramCounts.Values.Sum();

        SortedList<double, string> randomWordsScores = new SortedList<double, string>();

        while( randomWordsScores.Count < 100 )
        {
            string randStr = getRandomWord();

            if (!randomWordsScores.ContainsValue(randStr))
            {
                double score = getLikelyhood(randStr,trigramCounts, bigramCounts, onegramCounts, totalTrigrams, totalBigrams, totalOnegrams);

                if (score > SCORE_LIMIT)
                {
                    randomWordsScores.Add(score, randStr);
                }
            }
        }


        //Right now randomWordsScores contains 100 random words which have 
        //a better score than the SCORE_LIMIT, sorted from worst to best.
    }


    /// <summary>
    /// Generates a random 5-letter word
    /// </summary>
    public static string getRandomWord()
    {
        char c0 = (char)rand.Next(65, 90);
        char c1 = (char)rand.Next(65, 90);
        char c2 = (char)rand.Next(65, 90);
        char c3 = (char)rand.Next(65, 90);
        char c4 = (char)rand.Next(65, 90);

        return "" + c0 + c1 + c2 + c3 + c4;
    }

    /// <summary>
    /// Returns a score for how likely a given word is, based on given trigrams, bigrams, and one-grams
    /// </summary>
    public static double getLikelyhood(string wordToScore, Dictionary<Tuple<char, char,char>, int> trigramCounts, Dictionary<Tuple<char, char>, int> bigramCounts, Dictionary<char, int> onegramCounts, double totalTrigrams, double totalBigrams, double totalOnegrams)
    {
        wordToScore = wordToScore.ToUpper();

        char[] letters = wordToScore.ToCharArray();

        Tuple<char, char>[] bigrams = new Tuple<char, char>[]{ 

            new Tuple<char,char>( wordToScore[0], wordToScore[1] ),
            new Tuple<char,char>( wordToScore[1], wordToScore[2] ),
            new Tuple<char,char>( wordToScore[2], wordToScore[3] ),
            new Tuple<char,char>( wordToScore[3], wordToScore[4] )

        };

        Tuple<char, char, char>[] trigrams = new Tuple<char, char, char>[]{ 

            new Tuple<char,char,char>( wordToScore[0], wordToScore[1], wordToScore[2] ),
            new Tuple<char,char,char>( wordToScore[1], wordToScore[2], wordToScore[3] ),
            new Tuple<char,char,char>( wordToScore[2], wordToScore[3], wordToScore[4] ),


        };

        double score = 0;

        foreach (char c in letters)
        {
            score += Math.Log((((double)onegramCounts[c]) / totalOnegrams));
        }

        foreach (Tuple<char, char> pair in bigrams)
        {
            score += Math.Log((((double)bigramCounts[pair]) / totalBigrams));
        }

        foreach (Tuple<char, char, char> trio in trigrams)
        {
            score += 5.0*Math.Log((((double)trigramCounts[trio]) / totalTrigrams));
        }


        return score;
    }

    /// <summary>
    /// Build the probability tables based on the dataset (WordList)
    /// </summary>
    public static void calculateProbabilities(Dictionary<char, int> onegramCounts, Dictionary<Tuple<char, char>, int> bigramCounts, Dictionary<Tuple<char, char, char>, int> trigramCounts)
    {
        foreach (char c1 in alphabet)
        {
            foreach (char c2 in alphabet)
            {
                foreach( char c3 in alphabet)
                {
                    trigramCounts[new Tuple<char, char, char>(c1, c2, c3)] = 1;
                }
            }
        }

        foreach( char c1 in alphabet)
        {
            foreach( char c2 in alphabet)
            {
                bigramCounts[ new Tuple<char,char>(c1,c2) ] = 1;
            }
        }

        foreach (char c1 in alphabet)
        {
            onegramCounts[c1] = 1;
        }


        foreach (string word in wordList)
        {
            for (int pos = 0; pos < 3; pos++)
            {
                trigramCounts[new Tuple<char, char, char>(word[pos], word[pos + 1], word[pos + 2])]++;
            }

            for (int pos = 0; pos < 4; pos++)
            {
                bigramCounts[new Tuple<char, char>(word[pos], word[pos + 1])]++;
            }

            for (int pos = 0; pos < 5; pos++)
            {
                onegramCounts[word[pos]]++;
            }
        }
    }

    /// <summary>
    /// Get the dataset (WordList) from file.
    /// </summary>
    public static List<string> loadWordList()
    {
        string filePath = "WordList.txt";

        string text = File.ReadAllText(filePath);

        List<string> result = text.Split(' ').ToList();

        return result;
    }
}

}

在我的例子中,我用5缩放卦概率。

In my example, I scale the trigram probabilities by 5.

我还加一所有计数的,所以我们不要被零乘以。

I also add one to all of the counts, so we don't multiply by zero.

我不是一个PHP程序员,但技术是pretty的容易实现。

I'm not a php programmer, but the technique is pretty easy to implement.

玩弄一些缩放因子,尝试不同的数据集,或添加一些其他的检查,像你上面的建议是什么。

Play around with some scaling factors, try different datasets, or add in some other checks like what you suggested above.

这篇关于pronounceability算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆