C语言中的词频统计(不是C ++) [英] Word Frequency Statistics in C (not C++)

查看:103
本文介绍了C语言中的词频统计(不是C ++)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个字符串,该单词由单个空格分隔的单词组成,按降序打印出单词,这些单词按其出现在字符串中的次数排序.

Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.

例如,输入字符串"ab bc bc"将生成以下输出:

For example an input string of "ab bc bc" would generate the following output:

bc : 2
ab : 1

如果使用C ++数据结构(如地图),则可以轻松解决该问题.但是,如果只能在普通的旧C语言中解决问题,那么看起来就困难得多.

The problem would be easily resolved if C++ data structures, like a map, is used. But if the problem could only be solved in plain old C, it looks much harder.

在这里我应该使用哪种数据结构和算法?请尽可能详细.我在DS和Algo方面很弱.:-(

What kind of data structures and algorithms shall I use here? Please be as detailed as possible. I am weak in DS and Algo. :-(

推荐答案

以下是我如何执行此操作的示例.findWord()中的搜索可以进行优化.也可以通过一次分配单词块而不是一次来减少分配数量.一个人也可以为这种情况实现一个链表.它缺少内存释放.希望这可以帮助您前进.

Here's a sample of how I'd do it. The search in findWord() could be optimized. The number of allocations can also be reduced by allocating blocks of words instead of one at a time. One could implement a linked list for this case as well. It is lacking memory deallocation. This should hopefully get you going.

    #include <stdio.h>
    #include <assert.h>
    #include <stdlib.h>

    #define MAXWORDLEN 128

    const char* findWhitespace(const char* text)
    {
        while (*text && !isspace(*text))
            text++;
        return text;
    }

    const char* findNonWhitespace(const char* text)
    {
        while (*text && isspace(*text))
            text++;
        return text;
    }

    typedef struct tagWord
    {
        char word[MAXWORDLEN + 1];
        int count;
    } Word;

    typedef struct tagWordList
    {
        Word* words;
        int count;
    } WordList;

    WordList* createWordList(unsigned int count);

    void extendWordList(WordList* wordList, const int count)
    {
        Word* newWords = (Word*)malloc(sizeof(Word) * (wordList->count + count));
        if (wordList->words != NULL) {
            memcpy(newWords, wordList->words, sizeof(Word)* wordList->count);
            free(wordList->words);
        }
        for (int i = wordList->count; i < wordList->count + count; i++) {
            newWords[i].word[0] = '\0';
            newWords[i].count = 0;
        }
        wordList->words = newWords;
        wordList->count += count;
    }

    void addWord(WordList* wordList, const char* word)
    {
        assert(strlen(word) <= MAXWORDLEN);
        extendWordList(wordList, 1);
        Word* wordNode = &wordList->words[wordList->count - 1];
        strcpy(wordNode->word, word);
        wordNode->count++;  
    }

    Word* findWord(WordList* wordList, const char* word)
    {
        for(int i = 0; i < wordList->count; i++) {
            if (stricmp(word, wordList->words[i].word) == 0) {
                return &wordList->words[i];
            }
        }
        return NULL;
    }

    void updateWordList(WordList* wordList, const char* word)
    {
        Word* foundWord = findWord(wordList, word);
        if (foundWord == NULL) {
            addWord(wordList, word);
        } else {
            foundWord->count++;
        }
    }

    WordList* createWordList(unsigned int count)
    {
        WordList* wordList = (WordList*)malloc(sizeof(WordList));
        if (count > 0) {
            wordList->words = (Word*)malloc(sizeof(Word) * count);
            for(unsigned int i = 0; i < count; i++) {
                wordList->words[i].count = 0;
                wordList->words[i].word[0] = '\0';
            }
        }
        else {
            wordList->words = NULL;
        }
        wordList->count = count;    
        return wordList;
    }

    void printWords(WordList* wordList)
    {
        for (int i = 0; i < wordList->count; i++) {
            printf("%s: %d\n", wordList->words[i].word, wordList->words[i].count);
        }
    }

    int compareWord(const void* vword1, const void* vword2)
    {
        Word* word1 = (Word*)vword1;
        Word* word2 = (Word*)vword2;
        return strcmp(word1->word, word2->word);
    }

    void sortWordList(WordList* wordList)
    {
        qsort(wordList->words, wordList->count, sizeof(Word), compareWord);
    }

    void countWords(const char* text)
    {
        WordList   *wordList = createWordList(0);
        Word       *foundWord = NULL;
        const char *beg = findNonWhitespace(text);
        const char *end;
        char       word[MAXWORDLEN];

        while (beg && *beg) {
            end = findWhitespace(beg);
            if (*end) {
                assert(end - beg <= MAXWORDLEN);
                strncpy(word, beg, end - beg);
                word[end - beg] = '\0';
                updateWordList(wordList, word);
                beg = findNonWhitespace(end);
            }
            else {
                beg = NULL;
            }
        }

        sortWordList(wordList);
        printWords(wordList);
    }

int main(int argc, char* argv[])
{
    char* text = "abc 123 abc 456 def 789 \tyup this \r\ncan work yup 456 it can";
    countWords(text);
}

这篇关于C语言中的词频统计(不是C ++)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆