计算文本文件中单词的重复出现次数 [英] Count the reocurrence of words in text file

查看:108
本文介绍了计算文本文件中单词的重复出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以前的练习中,我扩展了一个文本文件,每行填充一个单词.

Expanding on my a previous exercise, I have a text file that is filled with one word per line.

hello
hi
hello
bonjour
bonjour
hello

当我从文件中读取这些单词时,我想将它们与结构指针数组(从文本文件创建)进行比较.如果单词在数组中不存在,则应将其存储到计数为1的结构指针中.如果单词已存在于数组中,则计数应增加1.我将结果写入新文件(已经存在).

As I read these words from the file I would like to compare them to an array of struct pointers (created from the text file). If the word does not exist within the array, the word should be stored into a struct pointer with a count of 1. If the word already exist in the array the count should increase by 1. I will write the outcome into a new file (that already exist).

hello = 3
hi = 1
bonjour = 2 

这是我的代码

#include <stdio.h>
#include <stdlib.h>
struct wordfreq{
    int count;
    char *word;
};
int main(int argc, char * argv[]) {
    struct wordfreq *words[1000] = {NULL};
    int i, j, f = 0;
    for(i=0; i <1000; i++)
        words[i] = (struct wordfreq*)malloc(sizeof(struct wordfreq));

    FILE *input = fopen(argv[1], "r");
    FILE *output = fopen(argv[2], "w");
    if(input == NULL){
        printf("Error! Can't open file.\n");
        exit(0);
    }

    char str[20];
    i=0;
    while(fscanf(input, "%s[^\n]", &str) ==1){
        //fprintf(output, "%s:\n", str);
        for(j=0; j<i; j++){
            //fprintf(output, "\t%s  == %s\n", str, words[j] -> word);
            if(str == words[j]->word){
                words[j] ->count ++;
                f = 1;
            }
        }
        if(f==0){
            words[i]->word = str;
            words[i]->count = 1;
        }
        //fprintf(output, "\t%s = %d\n", words[i]->word, words[i]->count);
        i++;
    }

    for(j=0; j< i; j++)
        fprintf(output, "%s = %d\n", words[j]->word, words[j]->count);
    for(i=0; i<1000; i++){
        free(words[i]);
    }
    return 0;
}

我使用了几个fprintf语句查看我的值,并且可以看到,当str正确时,当我到达该行以将str与其他数组结构指针(str == words[I]->word)进行比较时,横向words[0] -> word始终相同因为str和words[i]->words的其余部分均为(null).我仍在尝试完全理解混合点和结构,并附带任何想法,评论和抱怨?

I used several fprintf statements to look at my values and I can see that while str is right, when I reach the line to compare str to the other array struct pointers (str == words[I]->word) during the transversal words[0] -> word is always the same as str and the rest of the words[i]->words are (null). I am still trying to completely understand mixing pointes and structs, with that said any thoughts, comments, complains?

推荐答案

您可能会使事情变得比必要的难,并且在输入文件的情况下,您肯定会分配比必需的997更多的结构.无需预先分配所有1000结构. (您可以这样做,这只是内存管理问题).关键是,每次遇到唯一单词时,您只需要分配一个新结构即可. (对于您的数据文件,为3次).对于所有其他情况,您只需更新count,以添加已存储单词的出现次数即可.

You may be making things a bit harder than necessary, and you are certainly allocating 997 more structures than necessary in the case of your input file. There is no need to allocate all 1000 structs up front. (you are free to do so, it's just a memory management issue). The key is that you only need allocate a new struct each time a unique word is encountered. (in the case of your data file, 3-times). For all other cases, you are simply updating count to add the occurrence for a word you have already stored.

此外,如果没有令人信服的理由使用struct,则使用 pointers-to-char 数组就像指向每个word的指针一样容易,并且然后是int [1000]的简单数组作为您的count(或频率)数组.你的选择.对于两个数组,只需要为每个唯一的word分配,而无需为每个struct单独分配.

Also, if there is no compelling reason to use a struct, it is just as easy to use an array of pointers-to-char as your pointers to each word, and then a simple array of int [1000] as your count (or frequency) array. Your choice. In the case of two arrays, you only need to allocate for each unique word and never need a separate allocation for each struct.

将这些部分放在一起,可以将代码(不包括文件,可以通过简单的重定向处理)减少到以下内容:

Putting those pieces together, you could reduce your code (not including the file -- which can be handled by simple redirection) to the following:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

enum { MAXC = 128, MAXW = 1000 };

struct wordfreq{
    int count;
    char *word;
};

int main (void) {

    struct wordfreq *words[MAXW] = {0};
    char tmp[MAXC] = "";
    int n = 0;

    /* while < MAXW unique words, read each word in file */
    while (n < MAXW && fscanf (stdin, " %s", tmp) == 1) {
        int i;

        for (i = 0; i < n; i++)         /* check against exising words */
            if (strcmp (words[i]->word, tmp) == 0)  /* if exists, break */
                break;

        if (i < n) {            /* if exists */
            words[i]->count++;  /* update frequency */
            continue;           /* get next word */
        }

        /* new word found, allocate struct and
         * allocate storage for word (+ space for nul-byte) 
         */
        words[n] = malloc (sizeof *words[n]);
        words[n]->word = malloc (strlen (tmp) + 1);
        if (!words[n] || !words[n]->word) { /* validate ALL allocations */
            fprintf (stderr, "error: memory exhausted, words[%d].\n", n);
            break;
        }
        words[n]->count = 0;    /* initialize count */

        strcpy (words[n]->word, tmp); /* copy new word to words[n] */
        words[n]->count++;            /* update frequency to 1 */
        n++;                          /* increment word count */
    }

    for (int i = 0; i < n; i++) { /* for each word */
        printf ("%s = %d\n", words[i]->word, words[i]->count);
        free (words[i]->word);    /* free memory when no longer needed */
        free (words[i]);
    }

    return 0;
}

示例输入文件

$ cat dat/wordfile.txt
hello
hi
hello
bonjour
bonjour
hello

使用/输出示例

$ ./bin/filewordfreq <dat/wordfile.txt
hello = 3
hi = 1
bonjour = 2

与任何动态分配内存的代码一样,您将要验证对内存的使用,以确保您没有写超出范围或没有条件地移动或跳转到未初始化的值.在Linux中,valgrind是很自然的选择(每个OS都有类似的程序).只需通过它运行程序,例如:

As with any code that dynamically allocates memory, you will want to validate your use of the memory to insure you have not written beyond the bounds or based a conditional move or jump on an uninitialized value. In Linux, valgrind is the natural choice (there are similar programs for each OS). Just run you program through it, e.g.:

$ valgrind ./bin/filewordfreqstruct <dat/wordfile.txt
==2000== Memcheck, a memory error detector
==2000== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==2000== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==2000== Command: ./bin/filewordfreqstruct
==2000==
hello = 3
hi = 1
bonjour = 2
==2000==
==2000== HEAP SUMMARY:
==2000==     in use at exit: 0 bytes in 0 blocks
==2000==   total heap usage: 6 allocs, 6 frees, 65 bytes allocated
==2000==
==2000== All heap blocks were freed -- no leaks are possible
==2000==
==2000== For counts of detected and suppressed errors, rerun with: -v
==2000== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

验证您free您分配的所有内存,并且没有内存错误.

Verify that you free all memory you allocate and that there are no memory errors.

仔细检查一下,如果还有其他问题,请告诉我.

Look things over and let me know if you have any further questions.

使用2数组代替 struct

如上所述,有时使用存储阵列频率阵列可以简化完成同一件事的过程.每当您需要任何集合"的频率时,您首先想到的应该是频率阵列.它不过是一个与您的集合"中的项目数相同大小的数组(在开始时初始化为0).使用相同的方法,当在存储数组中添加(或查找现有副本)元素时,将频率数组中的相应元素增加1 .完成后,您的频率数组元素会保持存储数组中相应元素出现的频率.

As mentioned above, sometimes using a storage array and a frequency array can simplify accomplishing the same thing. Whenever you are faced with needing the frequency of any "set", your first thought should be a frequency array. It is nothing more than an array of the same size as the number of items in your "set", (initialized to 0 at the beginning). The same approach applies, when you add (or find a duplicate of an existing) element in your storage array, you increment the corresponding element in your frequency array by 1. When you are done, your frequency array elements hold the frequency the corresponding elements in your storage array appear.

这里等同于上面的程序.

Here is an equivalent to the program above.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

enum { MAXC = 128, MAXW = 1000 };

int main (void) {

    char *words[MAXW] = {NULL},   /* storage array of pointers to char* */
        tmp[MAXC] = "";
    int freq[MAXW] = {0}, n = 0;  /* simple integer frequency array */

    /* while < MAXW unique words, read each word in file */
    while (n < MAXW && fscanf (stdin, " %s", tmp) == 1) {
        int i;

        for (i = 0; words[i]; i++)   /* check against exising words */
            if (strcmp (words[i], tmp) == 0)    /* if exists, break */
                break;

        if (words[i]) {     /* if exists */
            freq[i]++;      /* update frequency */
            continue;       /* get next word */
        }

        /* new word found, allocate storage (+ space for nul-byte) */
        words[n] = malloc (strlen (tmp) + 1);
        if (!words[n]) {    /* validate ALL allocations */
            fprintf (stderr, "error: memory exhausted, words[%d].\n", n);
            break;
        }

        strcpy (words[n], tmp); /* copy new word to words[n] */
        freq[n]++;              /* update frequency to 1 */
        n++;                    /* increment word count */
    }

    for (int i = 0; i < n; i++) {                 /* for each word */
        printf ("%s = %d\n", words[i], freq[i]);  /* output word + freq */
        free (words[i]);           /* free memory when no longer needed */
    }

    return 0;
}

使用这种方法,通过为count使用静态声明的频率数组,可以消除1/2的内存分配.两种方法都可以,取决于您自己.

Using this approach, you eliminate 1/2 of your memory allocations by using a statically declared frequency array for your count. Either way is fine, it is largely up to you.

这篇关于计算文本文件中单词的重复出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆