在x86上处理非常大的列表 [英] Dealing with very large Lists on x86

查看:30
本文介绍了在x86上处理非常大的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要处理大量的浮点数,但是我遇到了x86系统上的内存限制.我不知道最终长度,所以我需要使用可扩展类型.在x64系统上,我可以使用< gcAllowVeryLargeObjects> .

I need to work with large lists of floats, but I am hitting memory limits on x86 systems. I do not know the final length, so I need to use an expandable type. On x64 systems, I can use <gcAllowVeryLargeObjects>.

我当前的数据类型:

List<RawData> param1 = new List<RawData>();
List<RawData> param2 = new List<RawData>();
List<RawData> param3 = new List<RawData>();

public class RawData
{
    public string name;
    public List<float> data;
}

paramN列表的长度很短(当前为50或更小),但是数据可以为10m +.当长度为50时,我在1m数据点以上达到内存限制( OutOfMemoryException ),而当长度为25时,我在2m数据点以上达到内存限制.(如果我的计算正确,那正好是200MB,再加上名称的大小,再加上开销).我可以使用什么来增加此限制?

The length of the paramN lists is low (currently 50 or lower), but data can be 10m+. When the length is 50, I am hitting memory limits (OutOfMemoryException) at just above 1m data points, and when the length is 25, I hit the limit at just above 2m data points. (If my calculations are right, that is exactly 200MB, plus the size of name, plus overhead). What can I use to increase this limit?

我尝试使用最大内部列表大小为1<<<17(131072),该限制有所提高,但仍达不到我想要的程度.

I tried using List<List<float>> with a max inner list size of 1 << 17 (131072), which increased the limit somewhat, but still not as far as I want.

Edit2 :我尝试将列表"中的块大小减小到8192,并且OOM达到约2.3m元素,任务管理器读取了约1.4GB的进程.看来我需要减少数据源与存储之间的内存使用,或者更频繁地触发GC-我能够在具有4GB RAM的PC上的x64进程中收集10m个数据点,IIRC进程从未超过3GB

I tried reducing the chunk size in the List> to 8192, and I got OOM at ~2.3m elements, with task manager reading ~1.4GB for the process. It looks like I need to reduce memory usage in between the data source and the storage, or trigger GC more often - I was able to gather 10m data points in a x64 process on a pc with 4GB RAM, IIRC the process never went over 3GB

Edit3 :我将代码压缩为仅处理数据的部分. http://pastebin.com/maYckk84

I condensed my code down to just the parts that handle the data. http://pastebin.com/maYckk84

Edit4 :我查看了DotMemory,发现我的数据结构在我测试的设置下确实占用了约1GB的空间(50ch * 3 params * 2m events = 300,000,000 float elements).我想我需要将其限制在x86上,或者弄清楚如何在获取数据时以这种格式写入磁盘

I had a look in DotMemory, and found that my data structure does take up ~1GB with the settings I was testing on (50ch * 3 params * 2m events = 300,000,000 float elements). I guess I will need to limit it on x86 or figure out how to write to disk in this format as I get data

推荐答案

首先,在x86系统上,内存限制为2GB,而不是200MB.我相信您的问题要比这复杂得多.您会有激进的LOH(大对象堆)碎片.
CLR对大小对象使用不同的堆.如果对象的大小大于85,000字节,则该对象很大.LOH是一件非常棘手的事情,它不急于将未使用的内存返回给OS,并且在碎片整理方面非常差.
.Net List是ArrayList数据结构的实现,它将元素存储在具有固定大小的数组中;数组填充后,将创建新的数组,其大小增加一倍.阵列随着数据量的不断增长对于LOH来说是饥饿"的情况.
因此,您必须使用量身定制的数据结构来满足您的需求.例如.块列表,每个块都足够小,不会进入LOH.这是小的原型:

First of all, on x86 systems memory limit is 2GB, not 200MB. I presume your problem is much more trickier than that. You have aggressive LOH (large object heap) fragmentation.
CLR uses different heaps for small and large objects. Object is large if its size is larger than 85,000 bytes. LOH is a very fractious thing, it is not eager to return unused memory back to OS, and it is very poor at defragmentation.
.Net List is implementation of ArrayList data structure, it stores elements in array, which has fixed size; when array is filled, new array with doubled size is created. That continuous growth of array with your amount of data is a "starvation" scenario for LOH.
So, you have to use tailor-made data structure to suit your needs. E.g. list of chunks, with each chunk is small enough not to get into LOH. Here is small prototype:

public class ChunkedList
{
    private readonly List<float[]> _chunks = new List<float[]>();
    private const int ChunkSize = 8000;
    private int _count = 0;       

    public void Add(float item)
    {            
        int chunk = _count / ChunkSize;
        int ind = _count % ChunkSize;
        if (ind == 0)
        {
            _chunks.Add(new float[ChunkSize]);
        }
        _chunks[chunk][ind] = item;
        _count ++;
    }

    public float this[int index]
    {
        get
        {
            if(index <0 || index >= _count) throw new IndexOutOfRangeException();
            int chunk = index / ChunkSize;
            int ind = index % ChunkSize;
            return _chunks[chunk][ind];
        }
        set
        {
            if(index <0 || index >= _count) throw new IndexOutOfRangeException();
            int chunk = index / ChunkSize;
            int ind = index % ChunkSize;
            _chunks[chunk][ind] = value;
        }
    }
    //other code you require
}

ChunkSize = 8000的情况下,每个块仅占用32,000个字节,因此不会进入LOH. _chunks 仅在集合中有大约16,000个块(集合中有超过1.28亿个元素(约500 MB))时才进入LOH.

With ChunkSize = 8000 every chunk will take only 32,000 bytes, so it will not get into LOH. _chunks will get into LOH only when there will be about 16,000 chunks in collection, which is more than 128 million elements in collection (about 500 MB).

UPD 我已经对上面的示例进行了一些压力测试.操作系统是x64,解决方案平台是x86.ChunkSize为20000.

UPD I've performed some stress tests for sample above. OS is x64, solution platform is x86. ChunkSize is 20000.

第一:

var list = new ChunkedList();
for (int i = 0; ; i++)
{
    list.Add(0.1f);
}

OutOfMemoryException引发了〜324,000,000个元素

OutOfMemoryException is raised at ~324,000,000 elements

第二:

public class RawData
{
    public string Name;
    public ChunkedList Data = new ChunkedList();
}

var list = new List<RawData>();
for (int i = 0;; i++)
{
    var raw = new RawData { Name = "Test" + i };
    for (int j = 0; j < 20 * 1000 * 1000; j++)
    {
        raw.Data.Add(0.1f);
    }
    list.Add(raw);
}

OutOfMemoryException在i = 17,j〜12,000,000时引发.成功创建了17个RawData实例,每个实例有2000万个数据点,总计约3.52亿个数据点.

OutOfMemoryException is raised at i=17, j~12,000,000. 17 RawData instances successfully created, 20 million data points per each, about 352 million data points totally.

这篇关于在x86上处理非常大的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆