寻找一种优化此算法以解析非常大的字符串的方法 [英] Looking for a way to optimize this algorithm for parsing a very large string

查看:69
本文介绍了寻找一种优化此算法以解析非常大的字符串的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下类分析一个非常大的字符串(整本小说),并将其分解为连续的4个字符的字符串,这些字符串存储为元组.然后,可以基于计算为每个元​​组分配一个概率.我将其用作蒙特卡洛/遗传算法的一部分,以训练程序识别仅基于语法的语言(只是字符转换).

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).

我想知道是否有更快的方法.查找任何给定的4个字符的元组的概率大约需要400毫秒.相关方法_Probablity()在类的末尾.

I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.

这是与我的另一篇文章有​​关的计算密集型问题:

This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method

最终,我想将这些值存储在4d矩阵中.但是考虑到字母表中有26个字母,这将是一项巨大的任务. (26x26x26x26).如果我只看小说的前15000个字符,那么性能会提高很多,但是我的数据没有那么有用.

Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.

以下是解析文本源"的方法:

Here is the method that parses the text 'source':

    private List<Tuple<char, char, char, char>> _Parse(string src)
    {
        var _map = new List<Tuple<char, char, char, char>>(); 

        for (int i = 0; i < src.Length - 3; i++)
        {
          int j = i + 1;
          int k = i + 2;
          int l = i + 3;

          _map.Add
            (new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l])); 
        }

        return _map; 
    }

这是_Probability方法:

And here is the _Probability method:

    private double _Probability(char x0, char x1, char x2, char x3)
    {
        var subset_x0 = map.Where(x => x.Item1 == x0);
        var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
        var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
        var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);

        int count_of_x0 = subset_x0.Count();
        int count_of_x1_following = subset_x0_x1_following.Count();
        int count_of_x2_following = subset_x0_x2_following.Count();
        int count_of_x3_following = subset_x0_x3_following.Count(); 

        decimal p1;
        decimal p2;
        decimal p3;

        if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
        {
            p1 = e;
            p2 = e;
            p3 = e;
        }
        else
        {
            p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
            p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
            p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;

            p1 = (p1 * 100) + e; 
            p2 = (p2 * 100) + e;
            p3 = (p3 * 100) + e; 
        }

        //more calculations omitted

        return _final; 
    }
}

编辑-我将提供更多详细信息以清除问题,

EDIT - I'm providing more details to clear things up,

1)严格来说,到目前为止,我只使用英语,但是确实必须考虑不同的字母.目前,我只希望程序能够识别英语,类似于本文中描述的内容:

1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf

2)我正在计算n个字符的n个元组的概率,其中n< =4.例如,如果我正在计算字符串"that"的总概率,则将其分解为这些独立的元组并计算每个人首先出现的概率:

2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:

[t] [h]

[t] [h] [a]

[t][h][a]

[t] [h] [a] [t]

[t][h][a][t]

[t] [h]被赋予最大的权重,然后是[t] [h] [a],然后是[t] [h] [a] [t].因为我不仅将4个字符的元组视为一个单元,所以我无法仅将文本中[t] [h] [a] [t]的实例除以总数.接下来的4元组.

[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.

分配给每个4元组的值不能适合文本,因为偶然有很多真实的英语单词可能永远不会出现在文本中,并且它们的分数不应成比例地降低.强调一阶字符过渡(2元组)可以改善此问题.移至3元组,然后4元组将细化计算.

The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.

我想出了一个Dictionary,它简单地计算出元组在文本中出现的频率(类似于Vilx的建议),而不是重复相同的元组,这会浪费内存.这使我从每次查询约400ms增至每次约40ms,这是一个很大的进步.但是,我仍然需要研究其他一些建议.

I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.

推荐答案

在yoiu概率方法中,您将地图重复了8次.您的每个位置都会遍历整个列表,因此计数也是如此.最后添加.ToList()广告将(可能)加快速度.那就是说,我认为您的主要问题是您选择用来存储数据的结构不适合概率方法的目的.您可以创建一个通过版本,在其中存储数据的结构将计算插入时的暂定分布.这样,当您完成插入操作(不应该放慢速度)时,您就可以完成操作,或者您可以这样做,因为下面的代码可以在需要时廉价地计算出概率.

In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.

顺便说一句,您可能需要考虑标点和空格.句子的第一个字母/单词和一个单词的第一个字母通过将标点符号和空格作为分发的一部分,可以清楚地指示给定文本使用的是哪种语言,其中包括示例数据的那些特征.几年前,我们做到了.这样做表明,仅使用三个字符几乎是完全正确的(我们在测试数据上没有出现三个字符的错误,并且几乎完全是一个假设,因为大多数情况下都会出现一些奇怪的文本,而缺少信息会导致错误的结果) .使用更多(我们最多测试7个),但是三个字母的速度使其成为最好的情况.

As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.

编辑

这是我认为我将在C#中做到这一点的一个例子

Here's an example of how I think I would do it in C#

class TextParser{
        private Node Parse(string src){
            var top = new Node(null);

            for (int i = 0; i < src.Length - 3; i++){
                var first = src[i];
                var second = src[i+1];
                var third = src[i+2];
                var fourth = src[i+3];

                var firstLevelNode = top.AddChild(first);
                var secondLevelNode = firstLevelNode.AddChild(second);
                var thirdLevelNode = secondLevelNode.AddChild(third);
                thirdLevelNode.AddChild(fourth);
            }

            return top;
        }
    }

    public class Node{
        private readonly Node _parent;
        private readonly Dictionary<char,Node> _children 
                         = new Dictionary<char, Node>();
        private int _count;

        public Node(Node parent){
            _parent = parent;
        }

        public Node AddChild(char value){
            if (!_children.ContainsKey(value))
            {
                _children.Add(value, new Node(this));
            }
            var levelNode = _children[value];
            levelNode._count++;
            return levelNode;
        }
        public decimal Probability(string substring){
            var node = this;
            foreach (var c in substring){
                if(!node.Contains(c))
                    return 0m;
                node = node[c];
            }
            return ((decimal) node._count)/node._parent._children.Count;
        }

        public Node this[char value]{
            get { return _children[value]; }
        }
        private bool Contains(char c){
            return _children.ContainsKey(c);
        }
    }

用法将是:

var top = Parse(src);
top.Probability("test");

这篇关于寻找一种优化此算法以解析非常大的字符串的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆