在串实习和替代品 [英] On string interning and alternatives

查看:150
本文介绍了在串实习和替代品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有,在本质上包含了像数据的大文件:

I have a large file which, in essence contains data like:

Netherlands,Noord-holland,Amsterdam,FooStreet,1,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,2,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,3,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,4,...,...
Netherlands,Noord-holland,Amsterdam,FooStreet,5,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,1,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,2,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,3,...,...
Netherlands,Noord-holland,Amsterdam,BarRoad,4,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,1,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,2,...,...
Netherlands,Noord-holland,Amstelveen,BazDrive,3,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,1,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,2,...,...
Netherlands,Zuid-holland,Rotterdam,LoremAve,3,...,...
...

这是一个多GB的文件。我有读取该文件,并公开这些行(记录)作为一类的IEnumerable<为MyObject> 。这为MyObject 有一些属性(国家城市,...)等。

This is a multi-gigabyte file. I have a class that reads this file and exposes these lines (records) as an IEnumerable<MyObject>. This MyObject has several properties (Country,Province,City, ...) etc.

正如你可以看到有数据的重复了很​​多。我想保持基础数据的公开为的IEnumerable&LT;为MyObject&GT; 。然而,其他一些类可能(而且很可能会),使这个数据像一些层次视图/结构:

As you can see there is a LOT of duplication of data. I want to keep exposing the underlying data as an IEnumerable<MyObject>. However, some other class might (and probably will) make some hierarchical view/structure of this data like:

Netherlands
    Noord-holland
        Amsterdam
            FooStreet [1, 2, 3, 4, 5]
            BarRoad [1, 2, 3, 4]
            ...
        Amstelveen
            BazDrive [1, 2, 3]
            ...
         ...
    Zuid-holland
        Rotterdam
            LoremAve [1, 2, 3]
            ...
        ...
    ...
...

当读取这个文件,我这样做,实际上,这样的:

When reading this file, I do, essentially, this:

foreach (line in myfile) {
    fields = line.split(",");
    yield return new MyObject {
        Country = fields[0],
        Province = fields[1],
        City = fields[2],
        Street = fields[3],
        //...other fields
    };
}

现在,手头的实际问题:我的可以使用中的String.intern()实习生国家,省,市,并街串(这些是主要的vilains,在为MyObject 有不相关的问题,其他几个属性)。

Now, to the actual question at hand: I could use string.Intern() to intern the Country, Province, City, and Street strings (those are the main 'vilains', the MyObject has several other properties not relevant to the question).

foreach (line in myfile) {
    fields = line.split(",");
    yield return new MyObject {
        Country = string.Intern(fields[0]),
        Province = string.Intern(fields[1]),
        City = string.Intern(fields[2]),
        Street = string.Intern(fields[3]),
        //...other fields
    };
}

持有整个数据集时,在内存中,因为所有重复的字符串将是一个参照相同的字符串,这将节省约42%的内存(测试和测量)。此外,有很多LINQ的 .ToDictionary()法的RESP的键(国家,省等)创建的层次结构时。辞典将更加高效。

This will save about 42% of memory (tested and measured) when holding the entire dataset in memory since all duplicate strings will be a reference to the same string. Also, when creating the hierarchical structure with a lot of LINQ's .ToDictionary() method the keys (Country, Province etc.) of the resp. dictionaries will be much more efficient.

然而,使用的缺点(除了轻微的性能损失,这不是问题)中的一个中的String.intern()是字符串的将不会被垃圾回收再。但是,当我与我的数据我做我的的希望所有的东西垃圾回收(最终)。

However, one of the drawbacks (aside a slight loss of performance, which is not problem) of using string.Intern() is that the strings won't be garbage collected anymore. But when I'm done with my data I do want all that stuff garbage collected (eventually).

我的可以的使用词典&LT;字符串,字符串&GT; 来实习这个数据,但我不喜欢的开销有一个我在哪里,其实,只关心。我可以设置或使用相同的字符串值(这将导致在相同的参考)。这只是一个很小的代价几个字节的付出,但它仍然是有代价的。

I could use a Dictionary<string, string> to 'intern' this data but I don't like the "overhead" of having a key and value where I am, actually, only interested in the key. I could set the value to null or the use the same string as value (which will result in the same reference in key and value). It's only a small price of a few bytes to pay, but it's still a price.

东西就像一个的HashSet&LT;字符串&GT; 让我更有意义。但是,我不能让一个引用字符串中的HashSet;我可以看到,如果HashSet中的包含的特定字符串,但没有得到引用位于字符串中的HashSet的具体实例。 我可以实现我自己的的HashSet ,但我想知道什么其他的解决方案,你有种StackOverflowers可能会想出。

Something like a HashSet<string> makes more sense to me. However, I cannot get a reference to a string in the HashSet; I can see if the HashSet contains a specific string, but not get a reference to that specific instance of the located string in the HashSet. I could implement my own HashSet for this, but I am wondering what other solutions you kind StackOverflowers may come up with.

要求:

  • 在我的FileReader类需要不断揭露的的IEnumerable&LT;为MyObject&GT;
  • 在我的FileReader类的可以的做的东西(如中的String.intern())来优化内存使用
  • 为MyObject 类的<​​em>不能的变化;我不会让一个类,国家类等,并且具有 MyObject来揭露那些为属性,而不是简单的字符串属性
  • 目标是成为(更多)的内存效率通过取消重复大部分重复的字符串的国家等;这是如何实现(例如串实习,某物内部hashset的/收集/结构)并不重要。但是:
  • 在我知道我可以的东西数据库中的数据,或者使用在这样的方向上其他解决方案;我的没有的兴趣在这些类型的解决方案。
  • 在速度只是次要的;更快的ofcourse更好,但在性能上(轻微)的损失,而读/迭代的对象是没有问题的
  • 由于这是一个长期运行的进程(如:Windows服务运行全天候)说,有时,处理大容量这个数据我希望数据是垃圾回收的,当我与它受够了;串实习的伟大工程,但会,从长远来看,导致有大量未使用的数据
  • 一个巨大的字符串池
  • 我想任何的解决方案是简单;增加15个教学班,P /调用和内联汇编(夸张)是不值得的努力。 code可维护性高我的名单上。
  • My "FileReader" class needs to keep exposing an IEnumerable<MyObject>
  • My "FileReader" class may do stuff (like string.Intern()) to optimize memory usage
  • The MyObject class cannot change; I won't make a City class, Country class etc. and have MyObject expose those as properties instead of simple string properties
  • Goal is to be (more) memory efficient by de-duplicating most of the duplicate strings in Country, Province, City etc.; how this is achieved (e.g. string interning, internal hashset / collection / structure of something) is not important. However:
  • I know I can stuff the data in a database or use other solutions in such direction; I am not interested in these kind of solutions.
  • Speed is only of secondary concern; the quicker the better ofcourse but a (slight) loss in performance while reading/iterating the objects is no problem
  • Since this is a long-running process (as in: windows service running 24/7/365) that, occasionally, processes a bulk of this data I want the data to be garbage-collected when I'm done with it; string interning works great but will, in the long run, result in a huge string pool with lots of unused data
  • I would like any solutions to be "simple"; adding 15 classes with P/Invokes and inline assembly (exaggerated) is not worth the effort. Code maintainability is high on my list.

这更多的是一种理论的问题;这纯粹是出于好奇/兴趣,我要问。有没有的真正的问题,但我的可以的看到,在类似的情况下,这种的也许的是一个问题的人。

This is more of a 'theoretical' question; it's purely out of curiosity / interest that I'm asking. There is no "real" problem, but I can see that in similar situations this might be a problem to someone.

例如:我可以做这样的事情:

For example: I could do something like this:

public class StringInterningObject
{
    private HashSet<string> _items;

    public StringInterningObject()
    {
        _items = new HashSet<string>();
    }

    public string Add(string value)
    {
        if (_items.Add(value))
            return value;  //New item added; return value since it wasn't in the HashSet
        //MEH... this will quickly go O(n)
        return _items.First(i => i.Equals(value)); //Find (and return) actual item from the HashSet and return it
    }
}

但随着大集(要取消重复)的字符串,这将很快陷入瘫痪。我可以有一个偷看参考源的HashSet 字典或...并建立一个不返回布尔一个类似的类添加()的方法,但在内部发现的实际字符串/桶。

But with a large set of (to be de-duplicated) strings this will quickly bog down. I could have a peek at the reference source for HashSet or Dictionary or... and build a similar class that doesn't return bool for the Add() method but the actual string found in the internals/bucket.

我能到现在是一样的东西拿出最好的:

The best I could come up with until now is something like:

public class StringInterningObject
{
    private ConcurrentDictionary<string, string> _items;

    public StringInterningObject()
    {
        _items = new ConcurrentDictionary<string, string>();
    }

    public string Add(string value)
    {
        return _items.AddOrUpdate(value, value, (v, i) => i);
    }
}

它具有含一键和的惩罚的一个值,我其实只关心重点。就在几个字节不过,小的代价。 Coincidally这也产生较少的42%,内存使用情况;相同的结果,使用中的String.intern()收益率的时候。

Which has the "penalty" of having a Key and a Value where I'm actually only interested in the Key. Just a few bytes though, small price to pay. Coincidally this also yields 42% less memory usage; the same result as when using string.Intern() yields.

tolanj想出了System.Xml.NameTable

public class StringInterningObject
{
    private System.Xml.NameTable nt = new System.Xml.NameTable();

    public string Add(string value)
    {
        return nt.Add(value);
    }
}

(我删除了锁定和的String.Empty检查(后者因为NameTable的已经这样做了))

(I removed the lock and string.Empty check (the latter since the NameTable already does that))

萨那托斯想出了一个CachingEqualityComparer

public class StringInterningObject
{
    private class CachingEqualityComparer<T> : IEqualityComparer<T> where T : class
    {
        public System.WeakReference X { get; private set; }
        public System.WeakReference Y { get; private set; }

        private readonly IEqualityComparer<T> Comparer;

        public CachingEqualityComparer()
        {
            Comparer = EqualityComparer<T>.Default;
        }

        public CachingEqualityComparer(IEqualityComparer<T> comparer)
        {
            Comparer = comparer;
        }

        public bool Equals(T x, T y)
        {
            bool result = Comparer.Equals(x, y);

            if (result)
            {
                X = new System.WeakReference(x);
                Y = new System.WeakReference(y);
            }

            return result;
        }

        public int GetHashCode(T obj)
        {
            return Comparer.GetHashCode(obj);
        }

        public T Other(T one)
        {
            if (object.ReferenceEquals(one, null))
            {
                return null;
            }

            object x = X.Target;
            object y = Y.Target;

            if (x != null && y != null)
            {
                if (object.ReferenceEquals(one, x))
                {
                    return (T)y;
                }
                else if (object.ReferenceEquals(one, y))
                {
                    return (T)x;
                }
            }

            return one;
        }
    }

    private CachingEqualityComparer<string> _cmp; 
    private HashSet<string> _hs;

    public StringInterningObject()
    {
        _cmp = new CachingEqualityComparer<string>();
        _hs = new HashSet<string>(_cmp);
    }

    public string Add(string item)
    {
        if (!_hs.Add(item))
            item = _cmp.Other(item);
        return item;
    }
}

(略作修改,以适合我的添加()接口)

(Modified slightly to "fit" my "Add() interface")

按<一个href="http://stackoverflow.com/questions/29984839/on-string-interning-and-alternatives?noredirect=1#comment48102485_29984839">Henk Holterman的申请:

public class StringInterningObject
{
    private Dictionary<string, string> _items;

    public StringInterningObject()
    {
        _items = new Dictionary<string, string>();
    }

    public string Add(string value)
    {
        string result;
        if (!_items.TryGetValue(value, out result))
        {
            _items.Add(value, value);
            return value;
        }
        return result;
    }
}

我只是想知道是否有可能一个整洁/更好/冷却器的方式来解决我的(没有这么多的实际)的问题。现在我有足够的选择,我猜

I'm just wondering if there's maybe a neater/better/cooler way to 'solve' my (not so much of an actual) problem. By now I have enough options I guess

下面是一些数字,我想出了一些简单的,短的,preliminary测试:

Here are some numbers I came up with for some simple, short, preliminary tests:


非优化
内存:〜4,5Gb
加载时间:〜52S


Non optimized
Memory: ~4,5Gb
Load time: ~52s


StringInterningObject (见上文的 ConcurrentDictionary 变量)
内存:〜2,6Gb
加载时间:〜49S


StringInterningObject (see above, the ConcurrentDictionary variant)
Memory: ~2,6Gb
Load time: ~49s


中的String.intern()
内存:〜2,3Gb
加载时间:〜45秒


string.Intern()
Memory: ~2,3Gb
Load time: ~45s


System.Xml.NameTable
内存:〜2,3Gb
加载时间:〜41S


System.Xml.NameTable
Memory: ~2,3Gb
Load time: ~41s


CachingEqualityComparer
内存:〜2,3Gb
载入时间: 〜58S


CachingEqualityComparer
Memory: ~2,3Gb
Load time: ~58s


StringInterningObject (见上文的(非并发)词典变体)按<一个href="http://stackoverflow.com/questions/29984839/on-string-interning-and-alternatives?noredirect=1#comment48102485_29984839">Henk Holterman的申请:
内存:〜2,3Gb
加载时间:〜39S


StringInterningObject (see above, the (non-concurrent) Dictionary variant) as per Henk Holterman's request:
Memory: ~2,3Gb
Load time: ~39s

虽然数字不是很明确,似乎很多内存分配的非优化的版本实际上减慢超过使用任何中的String.intern()或上述 StringInterningObject ■哪些导致(略)更长的加载时间。 此外,中的String.intern()似乎从 StringInterningObject 来'双赢',但不是以大比分; &LT;&LT;查看更新。

Although the numbers aren't very definitive, it seems that the many memory-allocations for the non-optimized version actually slow down more than using either string.Intern() or the above StringInterningObjects which results in (slightly) longer load times. Also, string.Intern() seems to 'win' from StringInterningObject but not by a large margin; << See updates.

推荐答案

我已经完全这一要求确实要求对SO,但随着的没有的喜欢你的问题,没有有效响应的细节。一个选择的中内置的是一个的(的System.Xml).NameTable ,这基本上是一个字符串雾化对象,这是你在找什么,我们有(我们实际上已经转移到实习生,因为我们做保留这些对于App-的生命之弦)。

I've had exactly this requirement and indeed asked on SO, but with nothing like the detail of your question, no useful responses. One option that is built in is a (System.Xml).NameTable, which is basically a string atomization object, which is what you are looking for, we had (we've actually move to Intern because we do keep these strings for App-life).

if (name == null) return null;
if (name == "") return string.Empty; 
lock (m_nameTable)
{
      return m_nameTable.Add(name);
}

在一个私人NameTable

on a private NameTable

http://referencesource.microsoft.com/#System.Xml/System/Xml/NameTable.cs,c71b9d3a7bc2d2af显示出其作为一个简单的哈希表来实现,即只存储每串一个参考。

http://referencesource.microsoft.com/#System.Xml/System/Xml/NameTable.cs,c71b9d3a7bc2d2af shows its implemented as a Simple hashtable, ie only storing one reference per string.

下行?它是完全的字符串具体。如果你做交叉测试的内存/速度我很想看到的结果。我们已经在使用的System.Xml严重,可能会当然似乎不那么自然。如果你是不是。

Downside? is its completely string specific. If you do cross-test for memory / speed I'd be interested to see the results. We were already using System.Xml heavily, might of course not seem so natural if you where not.

这篇关于在串实习和替代品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆