最近在C语言中进行重复数据删除的高效集合成员身份测试? [英] Efficient recent set-membership tests in C for deduplication?

查看:70
本文介绍了最近在C语言中进行重复数据删除的高效集合成员身份测试?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有无限数量的12字节消息到达.内容可以视为随机且无结构. (长度很重要,因为它比大多数哈希短).

I have an unlimited number of 12 byte messages arriving. The content can be treated as random and structureless. (The length is only important because it is shorter than most hashes.)

我想对它们进行重复数据删除.

I want to deduplicate them.

一种方法是将最后1,000条消息存储在循环缓冲区中,并在接受消息之前检查所有1,000条消息是否匹配(并将其插入循环缓冲区中以供将来检查).

One method is to store the last 1,000 messages in a circular buffer and check all 1,000 message for a match before accepting a message (and inserting it into the circular buffer for future checks).

还有哪些其他方法可以提高CPU和内存的使用效率?

What other methods are there, which could be more CPU and memory efficient?

推荐答案

12个字节的长度似乎很小.您可以通过利用strcmp()将字节数组转换为字符串,然后使用基于字符串的树结构.

12 Bytes seem quite small length. You can cast your byte array to string and then use a tree structure based on strings, via exploiting strcmp().

  1. 将字节数组转换为字符串的方法

基于字符串的树结构

除非您形成倾斜的树,否则O(logn)将是重复数据删除的最坏情况.在这种情况下,更改为自平衡树也不难.

Unless you form a skewed tree, O(logn) would be your worst case for deduplication. In such case, it is not hard to change to a self-balancing tree too.

这是我使用字符串类型键的BST实现:

Here my BST implementation using string type keys:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


struct Node
{
  char *key;
  char *value;
  struct Node *left;
  struct Node *right;
};

struct Node* newNode(char *strKey,char *strValue)
{
  struct Node *tmp = (struct Node*) malloc(sizeof(struct Node));
  tmp->key = strdup(strKey);
  tmp->value = strdup(strValue);
  tmp->left = NULL;
  tmp->right = NULL;
  return tmp;
}

struct Node* insert(struct Node* node, char *newKey, char *newValue)
{
  if (node == NULL)
    return newNode(newKey,newValue);

  int comparison = strcmp(newKey,node->key); 
  if (comparison < 0)
    node->left  = insert(node->left, newKey, newValue);
  else if (comparison > 0)
    node->right = insert(node->right, newKey, newValue);  
  else
  {
    printf("Error occured while insert to BST\n");
    return NULL;
  }

  return node;
}

struct Node* deleteNode(struct Node* node, char *key2del)
{
  if (node == NULL)
    return node;

  int comparison = strcmp(key2del,node->key);
  if (comparison < 0)
    node->left = deleteNode(node->left, key2del);
  else if (comparison > 0)
    node->right = deleteNode(node->right, key2del);
  else // where deletion occurs
  {
    if (node->left == NULL)
    {
      struct Node *tmp = node->right;
      free(node);
      return tmp;
    }
    else if (node->right == NULL)
    {
      struct Node *tmp = node->left;
      free(node);
      return tmp;
    }

    struct Node *tmp = node->right;
    while(tmp->left != NULL)
      tmp = tmp->left;

    node->key = tmp->key;
    node->right = deleteNode(node->right, tmp->key);
  }

  return node;
}

这篇关于最近在C语言中进行重复数据删除的高效集合成员身份测试?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆