去重字符串实例 [英] Deduplicate string instances

查看:40
本文介绍了去重字符串实例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有将近 1,000,000 条记录的数组,每条记录都有一个字段文件名".

I have array of nearly 1,000,000 records, each record has a field "filename".

有许多文件名完全相同的记录.

There are many records with exactly the same filename.

我的目标是通过对字符串实例(文件名实例,而不是记录)进行重复数据删除来减少内存占用.

My goal is to improve memory footprint by deduplicating string instances (filename instances, not records).

.NET Framework 2.0 是一个约束.这里没有 LINQ.

.NET Framework 2.0 is a constraint. no LINQ here.

我为重复数据删除编写了一个通用(和线程安全)类:

I wrote a generic (and thread-safe) class for the deduplication:

public class Deduplication<T>
    where T : class
{
    private static Deduplication<T> _global = new Deduplication<T>();

    public static Deduplication<T> Global
    {
        get { return _global; }
    }

    private Dictionary<T, T> _dic;// = new Dictionary<T, T>();
    private object _dicLocker = new object();

    public T GetInstance(T instance)
    {
        lock (_dicLocker)
        {
            if (_dic == null)
            {
                _dic = new Dictionary<T, T>();
            }

            T savedInstance;
            if (_dic.TryGetValue(instance, out savedInstance))
            {
                return savedInstance;
            }
            else
            {
                _dic.Add(instance, instance);
                return instance;
            }
        }
    }

    public void Clear()
    {
        lock (_dicLocker)
        {
            _dic = null;
        }
    }
}

这个类的问题在于它增加了大量的内存使用量,并且一直保持到下一次 GC.

The problem with this class is that it adds a lot of more memory usage, and it stays there until the next GC.

我正在寻找一种方法来减少内存占用,而无需增加大量内存使用量,也无需等待下一次 GC.另外我不想使用 GC.Collect() 因为它冻结了 GUI 几秒钟.

I searching for a way to reduce the memory footprint without adding a lot of more memory usage and without waiting for the next GC. Also i do not want to use GC.Collect() because it freezes the GUI for a couple of seconds.

推荐答案

如果您不想实习字符串.您可以采用与 Java 8 的字符串重复数据删除类似的方法(由堆上的 GC 完成).

If you do not want to intern your strings. You could take a similar approach to Java 8's string deduplication (which is done by the GC on the heap).

  1. 获取添加的字符串的哈希值.
  2. 如果哈希不存在,则将其与字符串相关联.
  3. 如果散列确实存在,则逐个字符比较具有相同散列的字符串.
  4. 如果您的比较匹配,则存储对原始字符串的引用而不是新副本.

假设您有很多重复项,这将减少您的内存占用,但实习可能会表现得更好,因为它在堆上的较低级别完成.

This would reduce your memory footprint assuming you have a lot of duplicates, but interning would probably perform a lot better as it is done at a lower level right on the heap.

这篇关于去重字符串实例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆