使用C从文件中删除所有重复的行 [英] Removing all duplicate lines from a file using C

查看:185
本文介绍了使用C从文件中删除所有重复的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在此问题中:
使用c检测文件上的重复行
i可以检测到重复的行,但是如何从文件中删除这行?



谢谢。



编辑:添加我的代码:

  #include< stdio.h> 
#include< stdlib.h>
#include< string.h>

struct somehash {
struct somehash * next;
unsigned hash;
char * mem;
};

#define THE_SIZE 100000

struct somehash * table [THE_SIZE] = {NULL,};

struct somehash ** some_find(char * str,unsigned len);
static unsigned some_hash(char * str,unsigned len);

int main(void)
{
char buffer [100];
struct somehash ** pp;
size_t len;
FILE * pFileIn;
FILE * pFileOut;

pFileIn = fopen(in.csv,r);
pFileOut = fopen(out.csv,w +);

if(pFileIn == NULL)perror(打开输入文件错误);
if(pFileOut == NULL)perror(打开输出文件错误);

while(fgets(buffer,sizeof buffer,pFileIn)){
len = strlen(buffer);
pp = some_find(buffer,len);
if(* pp){/ * found * /
fprintf(stderr,Duplicate:%s\\\
,buffer);
}
else
{/ *未找到:创建一个* /
fprintf(stdout,%s,缓冲区);
fprintf(pFileOut,%s,缓冲区);
* pp = malloc(sizeof ** pp);
(* pp) - > next = NULL;
(* pp) - > hash = some_hash(buffer,len);
(* pp) - > mem = malloc(1 + len);
memcpy((* pp) - > mem,buffer,1 + len);
}
}

return 0;
}

struct somehash ** some_find(char * str,unsigned len)
{
unsigned hash;
无符号短槽;
struct somehash ** hnd;

哈希= some_hash(str,len);
slot = hash%THE_SIZE; (hnd =& table [slot]; * hnd; hnd =&(* hnd) - > next){
if((* hnd) - > hash!= hash)继续;
if(strcmp((* hnd) - > mem,str))继续;
break;
}

return hnd;
}

static unsigned some_hash(char * str,unsigned len)
{
unsigned val;
unsigned idx;

if(!len)len = strlen(str);

val = 0; (idx = 0; idx val ^ =(val→2)^(val << 5)^(val < )^ str [idx] ^ 0x80001801;
}

return val;
}

但在输出文件中,我们总是第一次出现!



编辑2:澄清:目的是查找输入文件中的所有重复项。当输入中有多个行的实例时,该行不应该显示在所有输出 中。意图不仅仅是删除这个行的重复项,因此每一行只发生一次,而是删除一行的所有实例,如果该行在输入中复制。 p>

解决方案

本质上,从文本文件中删除行的唯一方法是复制没有这些行的文件。通常情况是这样的:

  while(fgets(buffer,size,infile))
if search(your_hashtable,buffer)== NOT_FOUND){
fputs(line,outfile);
insert(your_hashtable,buffer);
}

如果要保存一些存储空间,可以存储散列而不是完整的线。在理论上可能由于哈希冲突而失败,但是如果使用像SHA-256这样的加密散列,碰撞机会可能比由于CPU错误导致字符串比较错误的可能性慢。此外:如果您发现与SHA-256发生冲突,您可能至少有一点名望(如果不是财富)。



编辑:作为@Zack暗示,哈希大小的情况基本上是决定你愿意接受碰撞的机会的问题。使用256位哈希,这是很遥远的机会几乎不值得考虑。如果你把它减少到一个128位哈希值,那么这个机会有很大的提高,但是对于大多数实际的目的而言,它们仍然足够小。另一方面,如果你把它减少到一个32位的CRC,碰巧的可能性可能会高于我很高兴接受数据的重要性。



我应该再提一个可能性:另一种可能性是使用一些混合存储器,像32位CRC(这是真正的快速计算)以及文件中该行的偏移量。如果您的文件不超过4G,您可以只存储8个字节。



在这种情况下,您的工作方式略有不同:首先计算CRC,绝大多数时间,当它不在文件中时,您将文件复制到输出并将这些值插入散列表中。当它已经在表中时,你会回到可能相同的行,读回来,并与当前行进行比较。如果他们匹配,你回到你所在的地方,并前进到下一行。如果它们不匹配,则将当前行复制到输出,并将其偏移量添加到哈希表。



编辑2:我们假设文件足够小,您可以合理地整合内存。在这种情况下,您可以存储一行,以及一个行号。如果一行已经存储,您可以将其行号更改为-1,表示它的重复,不应该显示在输出中。



在C ++(因为它定义了相关的数据结构),它可能看起来像这样:

  std :: string line; 

typedef std :: map< std :: string,int> line_record;

line_record行;
int line_number = 1;

while(std :: getline(line,infile)){
line_record :: iterator existing = lines.find(line);
if(existing!= lines.end())//如果它已经在地图中
existing-> second = -1; //表示它是重复的
else
lines.insert(std :: make_pair(line,line_number); //否则添加到map
++ line_number;
}

好的,它读取行,每行都检查它是否已经在地图上如果是,它将line_number设置为-1,表示它不会出现在输出中,如果不是,它将其插入到地图中以及其行号。

  line_record :: iterator pos; 

std :: vector< line_record :: iterator> sortable_lines;

for(pos = lines.begin(); pos!= lines.end(); ++ pos)
if(pos-> second!= -1)
sortable_lines.push_back(pos);

这将 sortable_lines 设置为迭代器进入映射,所以我们不必复制整行,而是将迭代器(基本上像指针)复制到这些行,然后将迭代器复制到那里,但是只对于li nes行号不是-1。

  std :: sort(sortable_lines.begin(),sortable_lines.end ),by_line_number()); 

struct by_line_number {
bool operator()(line_record :: iterator a,line_record :: iterator b){
return a-> second< B->第二个;
}
};

然后我们按行号对这些迭代器进行排序。


$ b $ (int i = 0; i outfile<<< sortable_lines [i] - > first<< \\\
;

最后,我们按照原始行号将每行复制到输出文件。 p>

In this question: Detecting duplicate lines on file using c i can detect duplicate lines, but how we can remove this lines from our file?

Thanks.

Edit : To add my code :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct somehash {
    struct somehash *next;
        unsigned hash;
        char *mem;
};

#define THE_SIZE 100000

struct somehash *table[THE_SIZE] = { NULL,};

struct somehash **some_find(char *str, unsigned len);
static unsigned some_hash(char *str, unsigned len);

int main (void)
{
    char buffer[100];
    struct somehash **pp;
    size_t len;
    FILE * pFileIn;
    FILE * pFileOut;

    pFileIn  = fopen("in.csv", "r");
    pFileOut  = fopen("out.csv", "w+");

    if (pFileIn==NULL) perror ("Error opening input file");
    if (pFileOut==NULL) perror ("Error opening output file");

    while (fgets(buffer, sizeof buffer, pFileIn)) {
            len = strlen(buffer);
            pp = some_find(buffer, len);
            if (*pp) { /* found */
                fprintf(stderr, "Duplicate:%s\n", buffer);
                }
            else    
        {       /* not found: create one */
                    fprintf(stdout, "%s", buffer);
                    fprintf(pFileOut, "%s", buffer);
                    *pp = malloc(sizeof **pp);
                    (*pp)->next = NULL;
                    (*pp)->hash = some_hash(buffer,len);
                    (*pp)->mem = malloc(1+len);
                    memcpy((*pp)->mem , buffer,  1+len);
                }
        }

return 0;
}

struct somehash **some_find(char *str, unsigned len)
{
    unsigned hash;
    unsigned short slot;
    struct somehash **hnd;

    hash = some_hash(str,len);
    slot = hash % THE_SIZE;
    for (hnd = &table[slot]; *hnd ; hnd = &(*hnd)->next ) {
        if ( (*hnd)->hash != hash) continue;
            if ( strcmp((*hnd)->mem , str) ) continue;
                break;
        }

    return hnd;
}

static unsigned some_hash(char *str, unsigned len)
{
    unsigned val;
    unsigned idx;

    if (!len) len = strlen(str);

    val = 0;
    for(idx=0; idx < len; idx++ )   {
            val ^= (val >> 2) ^ (val << 5) ^ (val << 13) ^ str[idx] ^ 0x80001801;
    }

    return val;
}

But in the output file we got always the first occurrence!

Edit 2: To clarify: the intent is to find all duplicates in an input file. When there is more than one instance of a line in the input, that line should not appear in the output at all. The intent is not just to remove duplicates of that line so each occurs only once, but to remove all instances of a line if that line is duplicated in the input.

解决方案

Essentially the only way to remove lines from a text file is to copy the file without those lines in the copy. The usual would be something on this order:

while (fgets(buffer, size, infile))
    if (search(your_hashtable, buffer) == NOT_FOUND) {
        fputs(line, outfile);
        insert(your_hashtable, buffer);
    }

If you want to save some storage space, you might store hashes instead of complete lines. In theory that could fail due to a hash collision, but if you use a cryptographic hash like SHA-256, chances of a collision are probably slower than the chances of a string comparison coming out wrong due to a CPU error. Besides: if you find a collision with SHA-256, you can probably get at least a little fame (if not fortune) from that alone.

Edit: As @Zack alluded to, the situation with hash size is basically a matter of deciding what chance of a collision you're willing to accept. With a crypographic 256-bit hash, the chances are so remote it's hardly worth considering. If you reduce that to, say, a 128-bit hash, the chances go up quite a bit, but they're still small enough for most practical purposes. On the other hand, if you were to reduce it to, say, a 32-bit CRC, chances of a collision are probably higher than I'd be happy accepting if the data mattered much.

I should probably mention one more possibility: another possibility would be to use a bit of a hybrid -- store something like a 32-bit CRC (which is really fast to compute) along with the offset where that line in the file starts. If your file never exceeds 4G, you can store both in only 8 bytes.

In this case, you work just a little differently: you start by computing the CRC, and the vast majority of the time, when it's not in the file, you copy the file to the output and insert those values in the hash table. When it is already in the table, you seek back to the possibly-identical line, read it back in, and compare to the current line. If they match, you go back to where you were and advance to the next line. If they don't match, you copy the current line to the output, and add its offset to the hash table.

Edit 2: Let's assume for the moment that the file is small enough that you can reasonably fit the whole thing in memory. In that case, you can store a line, and a line number where it occurred. If a line is already stored, you can change its line number to -1, to indicate that it was duplicated and shouldn't appear in the output.

In C++ (since it defines the relevant data structures), it could look something like this:

std::string line;

typedef std::map<std::string, int> line_record;

line_record lines;
int line_number = 1;

while (std::getline(line, infile)) {
    line_record::iterator existing = lines.find(line);
    if (existing != lines.end()) // if it was already in the map
        existing->second = -1;    // indicate that it's duplicated
    else
        lines.insert(std::make_pair(line, line_number); // otherwise, add it to map
    ++line_number;
}

Okay, that reads in the lines, and for each line, it checks whether it's already in the map. If it is, it sets the line_number to -1, to indicate that it won't appear in the output. If it wasn't it inserts it into the map along with its line number.

line_record::iterator pos;

std::vector<line_record::iterator> sortable_lines;

for (pos=lines.begin(); pos != lines.end(); ++pos)
    if (pos->second != -1)
        sortable_lines.push_back(pos);

This sets up sortable_lines as a vector of iterators into the map, so instead of copying entire lines, we'll just copy iterators (essentially like pointers) to those lines. It then copies the iterators into there, but only for lines where the line number isn't -1.

std::sort(sortable_lines.begin(), sortable_lines.end(), by_line_number());

struct by_line_number {
     bool operator()(line_record::iterator a, line_record::iterator b) { 
         return a->second < b->second;
     }
};

Then we sort those iterators by the line number.

for (int i=0; i<sortable_lines.size(); i++)
     outfile << sortable_lines[i]->first << "\n";

Finally, we copy each line to the output file, in order by their original line numbers.

这篇关于使用C从文件中删除所有重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆