如何找到重复的字组文本用C#? [英] How to find recurring word groups in text with C#?

查看:158
本文介绍了如何找到重复的字组文本用C#?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到在的的StringBuilder(SB)的这段代码我已经互联网上找到,并根据作家它像Word的字计数器真的是一致的。

I'm getting recurring word counts in StringBuilder(sb) with this code which i've found on internet and according to writer it's really consistent like Word's word counter.

StringBuilder wordBuffer = new StringBuilder();
        int wordCount = 0;
        // 1. Build the list of words used. Consider ''' (apostrophe) and '-' (hyphen) a word continuation character.
        Dictionary<string, int> wordList = new Dictionary<string, int>();
        foreach (char c in sb.ToString())
        {

            if (char.IsLetter(c) || c == '\'' || c == '-')
            {
                wordBuffer.Append(char.ToLower(c));
            }
            else
            {
                if (wordBuffer.Length > 3)
                {
                    int count = 0;
                    string word = wordBuffer.ToString();
                    wordList.TryGetValue(word, out count);
                    wordList[word] = ++count;

                    wordBuffer.Clear();
                    wordCount++;
                }
            }
        }

这是我的示例文本:

绿藻(奇异:绿藻)是由该Chlorophyte和轮藻藻的藻类大量的,非正式的分组,这是现在放置在独立的部门。
中的陆生植物或有胚植物(高等植物)被认为有从轮藻出现。[1]作为有胚植物不藻类,因此被排除在外,绿藻是一系群。然而,包括绿藻和有胚植物的分支是单系的分支和植物界的植物界被提及。绿藻包括单细胞和殖民鞭毛虫,最每个小区双鞭毛,以及各种殖民,球菌和丝状的形式,和宏观,多细胞藻类。在Charales,高等植物的近亲,组织的全细胞分化发生。有大约8000种绿藻。[2]许多物种生活了大半生为单细胞,而其他物种形成coenobia(殖民地),长的细丝,或高度差异化的宏观海藻。
其他一些生物依靠绿藻为他们进行光合作用。在euglenids和chlorarachniophytes叶绿体从摄入绿藻获得的,[1]和在后者保留nucleomorph(残留核)。绿藻也共生发现纤毛虫草履虫,并在九头蛇viridissima和扁形虫。绿藻的一些品种,尤其是类共球藻纲和Trentepohlia(类石莼纲)的属Trebouxia,都可以在共生中找到真菌,形成地衣。在一般的真菌在地衣伴侣不能生活在他们自己,而往往是发现了藻类生活在自然界中没有的真菌。 Trentepohlia是丝状绿藻,可以在潮湿的土壤,岩石或树皮独立生活或形成家庭Graphidaceae的地衣的photosymbiont。

The green algae (singular: green alga) are a large, informal grouping of algae consisting of the Chlorophyte and Charophyte algae, which are now placed in separate Divisions. The land plants or Embryophytes (higher plants) are thought to have emerged from the Charophytes.[1] As the embryophytes are not algae, and are therefore excluded, green algae are a paraphyletic group. However, the clade that includes both green algae and embryophytes is monophyletic and is referred to as the clade Viridiplantae and as the kingdom Plantae. The green algae include unicellular and colonial flagellates, most with two flagella per cell, as well as various colonial, coccoid and filamentous forms, and macroscopic, multicellular seaweeds. In the Charales, the closest relatives of higher plants, full cellular differentiation of tissues occurs. There are about 8,000 species of green algae.[2] Many species live most of their lives as single cells, while other species form coenobia (colonies), long filaments, or highly differentiated macroscopic seaweeds. A few other organisms rely on green algae to conduct photosynthesis for them. The chloroplasts in euglenids and chlorarachniophytes were acquired from ingested green algae,[1] and in the latter retain a nucleomorph (vestigial nucleus). Green algae are also found symbiotically in the ciliate Paramecium, and in Hydra viridissima and in flatworms. Some species of green algae, particularly of genera Trebouxia of the class Trebouxiophyceae and Trentepohlia (class Ulvophyceae), can be found in symbiotic associations with fungi to form lichens. In general the fungal species that partner in lichens cannot live on their own, while the algal species is often found living in nature without the fungus. Trentepohlia is a filamentous green alga that can live independently on humid soil, rocks or tree bark or form the photosymbiont in lichens of the family Graphidaceae.

使用我的示例文本,我得到的绿色 藻类在第一行字预期。

With my sample text, I'm getting green and algae words in the first lines as expected.

问题是,我也不只需要简单的词,我需要的字组了。有了这个例子中的文字,我想 绿藻话说得太连同 绿色 藻类的话。

Problem is, I don't need only single words, I need word groups too. With this example text, I want green algae words too, together with green and algae words.

和我的可选问题的是:我需要高性能的去做,因为文本可能会很长。由于我研究它不使用正则表达式与这种情况下的高性能,但我不知道是否有使人们有可能用另一种方式。

And my optional problem is: I need to do it with high performance, because texts can be very long. As i researched it's not high performance to use RegEx with this case, but I'm not sure about if there is a second way to make it possible.

先谢谢了。

更新 如果你得到了什么我M问,你并不需要阅读这些行。的结果
当我看到关于我的团definiton太多的评论是不明确的,我想我要说明我的观点更多的细节,我希望写在评论部分这些线路,但它是本次更新有点狭窄区域。首先,我知道计算器不是一个编码服务。我试图找到最常用的字组的一篇文章中,并试图决定什么是对,我们可以把它叫做标记生成的文章了。为此,我试图找到最常用的词,它在一开始还行。然后,我意识到这不是决定的话题,因为我不能假设的文章是关于只有第一个或第二个单词的好方法。在我的例子,我不能说这篇文章仅约绿色藻类,因为它们意味着什么在这里汇聚,并不孤单。如果我尝试此有关,如海伦娜·伯翰·卡特三名人命名的文章(如果我假设它是写全名一起的文章,不仅姓),我想就不一一把这些词放在一起。我试图实现更聪明的算法,该算法在猜测的话题最准确的方法,并一炮打响。我不希望限制字数,因为文章可能是关于联合国工业发展组织(我再次认为它现在写的工发组织的文章)。我可以通过努力获得任何索引与任何长度的文本月底开始的每一个字组实现这一目标。好吧这不是一个好办法真的,尤其是长文,但是这不是不可能的对不对?但是,我一直在寻找更好的方法来做到这一点,我刚才问一个更好的算法思想和使用,我可以自己编写代码的最佳工具。我希望我说我的目标明确的最后。

UPDATE If you got what I'm asking about, you don't need to read these lines.
As I see too many comments about my "group" definiton is not clear, I think I need to state my point with more detail and I wished write these lines on comments section but it's a little narrow area for this update. Firstly, I know StackOverflow is not a coding service. I'm trying to find the most used word groups in an article and trying to decide what's article about, we can call it tag generator too. For this purpose I tried to find most used words and it was okay at the beginning. Then i realized it's not a good way to decide about topic because I can't assume the article is about only first or second word. In my example I can't say this article is only about green or algae because they mean something together here, not alone. If i try this with an article about a three named celebrity like "Helena Bonham Carter" (if I assume it's written full name along article, not only surname), I want to take these words together not one by one. I'm trying to achieve more clever algorithm which is guessing the topic in most accurate way and with one shot. I don't want to limit the word count because article may be about "United Nations Industrial Development Organization" (again I assume it's now written like "UNIDO" in article). And I can achieve this by trying to get every word group starting from any index to the end of text with any length. Okay it's not a good way really, especially with long texts but it's not impossible right? But i was looking for a better way to do this and I just asked about a better algorithm idea and best tool to use, I can write the code by myself. I hope I stated my goal clear finally.

推荐答案

我觉得这工作得相当好。

I think that this works fairly well.

var text = @"The green algae (singular: green alga) are ..."; // include all your text

var remove = "().,:[]0123456789".Select(x => x.ToString()).ToArray();

var words =
    Regex
        .Matches(text, @"(\S+)")
        .Cast<Match>()
        .SelectMany(x => x.Captures.Cast<Capture>())
        .Select(x => remove.Aggregate(x.Value, (t, r) => t.Replace(r, "")))
        .Select(x => x.Trim().ToLowerInvariant())
        .Where(x => !String.IsNullOrWhiteSpace(x))
        .ToArray();

var groups =
    from n1 in Enumerable.Range(0, words.Length)
    from n2 in Enumerable.Range(1, words.Length - n1)
    select String.Join(" ", words.Skip(n1).Take(n2));

var frequencies =
    groups
        .GroupBy(x => x)
        .Select(x => new { wordgroup = x.Key, count = x.Count() })
        .OrderByDescending(x => x.count)
        .ThenBy(x => x.wordgroup.Count(y => y == ' '))
        .ThenBy(x => x.wordgroup)
        .ToArray();

这给了我的话,包括最多一个字的连续序列的每一个单词分组的频率组中的所有单词。

This gives me the frequency of every single word grouping of contiguous sequences of words including up to a single word group of all the words.

字的数量是288. 字组数是 288点¯x (288 + 1)/ 2 = 41616 。在最后为41449(分组重复的字组和删除空/空白字符串后),字组数。

The number of words is 288. The total number of word groups is 288 x (288 + 1) / 2 = 41,616. The final number of word groups (after grouping duplicate word groups and removing empty/whitespace strings) is 41,449.

下面是第100这些41449:

Here are the first 100 of these 41,449:

20×该,13×和,12×藻类,12×中,11 X绿色,10×,9×绿藻,8×是,6×为6×种,5×一个,4 x是,4×的 或,4×到3×有胚植物,3×表,3×找到,3×地衣,3×活,3×上,3×植物3×,在3×的2×藻类,2,3×藻类和3×,并在3×为3× X可以,2个分支,2个,从下课,2个殖民地,2个丝状,2个,2个高级,2个宏观,2个最,2个等,2个海藻,2个自己的,2个trentepohlia,2个,而2 X与2 X藻类,2个是一个在地衣,2×绿藻,2×高等植物,2×,2×绿色,2×的2×进化枝物种,2×绿色,2个绿色藻类和,2个绿色藻类,2个绿藻,2个绿色物种,2个绿藻,2个绿色藻类物种 ,1个左右,1个后天,1个海藻,1个也,1个协会,1个汪汪,1个是,1个既,1 X不能,1个细胞,1个细胞,1个细胞,1个charales,1个轮藻,1个轮藻,1个chlorarachniophytes,1个 chlorophyte,1×叶绿体,1×纤毛虫,1×接近,1×球菌,1×coenobia,1×菌落,1×行为,1×由......组成 ,1个内外有别,1个差异化,1个师,1个涌现,1个euglenids,1个除外,1个家庭,1个少数,1 x的长丝,1×鞭毛,1×鞭毛虫,1×扁虫,1×为1×形式1×满,1×真菌,1×真菌

20 x "the", 13 x "and", 12 x "algae", 12 x "in", 11 x "green", 10 x "of", 9 x "green algae", 8 x "are", 6 x "as", 6 x "species", 5 x "a", 4 x "is", 4 x "or", 4 x "to", 3 x "embryophytes", 3 x "form", 3 x "found", 3 x "lichens", 3 x "live", 3 x "on", 3 x "plants", 3 x "that", 3 x "algae and", 3 x "and in", 3 x "as the", 3 x "in the", 3 x "of the", 2 x "alga", 2 x "can", 2 x "clade", 2 x "class", 2 x "colonial", 2 x "filamentous", 2 x "from", 2 x "higher", 2 x "macroscopic", 2 x "most", 2 x "other", 2 x "seaweeds", 2 x "their", 2 x "trentepohlia", 2 x "while", 2 x "with", 2 x "algae are", 2 x "are a", 2 x "green alga", 2 x "higher plants", 2 x "in lichens", 2 x "of green", 2 x "species of", 2 x "the clade", 2 x "the green", 2 x "green algae and", 2 x "green algae are", 2 x "of green algae", 2 x "species of green", 2 x "the green algae", 2 x "species of green algae", 1 x "about", 1 x "acquired", 1 x "algal", 1 x "also", 1 x "associations", 1 x "bark", 1 x "be", 1 x "both", 1 x "cannot", 1 x "cell", 1 x "cells", 1 x "cellular", 1 x "charales", 1 x "charophyte", 1 x "charophytes", 1 x "chlorarachniophytes", 1 x "chlorophyte", 1 x "chloroplasts", 1 x "ciliate", 1 x "closest", 1 x "coccoid", 1 x "coenobia", 1 x "colonies", 1 x "conduct", 1 x "consisting", 1 x "differentiated", 1 x "differentiation", 1 x "divisions", 1 x "emerged", 1 x "euglenids", 1 x "excluded", 1 x "family", 1 x "few", 1 x "filaments", 1 x "flagella", 1 x "flagellates", 1 x "flatworms", 1 x "for", 1 x "forms", 1 x "full", 1 x "fungal", 1 x "fungi"

这篇关于如何找到重复的字组文本用C#?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆