如何在多个文本文件中获取不同的值? [英] How to fetch distinct values in multiple text files ?

查看:93
本文介绍了如何在多个文本文件中获取不同的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个文件F1,F2和F3。

每个文件包含1000万条记录(所有数字)。

现在我必须从所有这些文件中删除重复项例如,数字N1在所有文件中应该是唯一的。

我已经通过加载不同的hashset中的每个文件来解决这个问题。如果我将所有记录存储在一个hashset中,则抛出异常内存。

有没有办法通过我可以检查所有三个hashSet中的唯一值吗?

或者有没有更好的方法来解决这个问题而不使用任何数据库?

解决方案

< blockquote>



不错的问题,



解决方案:

< a href =http://msdn.microsoft.com/en-us/library/dd997372.aspx> http://msdn.microsoft.com/en-us/library/dd997372.aspx [< a href =http://msdn.microsoft.com/en-us/library/dd997372.aspxtarget =_ blanktitle =New Window> ^ ]



更好的表现 [ ^ ]



Memor y地图文件。



如果您有兴趣我还有其他解决方案:)



[更新] < br $> b $ b

您关心的是保存内存吗?



您可以使用Split文件并搜索唯一的内容,但是问题是,当你搜索唯一的数字时,它可能会越过内存限制。



所以最好先排序再搜索,



为了获得最佳性能,



休息这些步骤:



1.将文件分成小部分。

2.编写一个函数,该函数将从每个文件中编号并从该文件中删除。

3 。启动单独的线程并将每个文件传递给该函数。

4.创建一些数据块文件,如1 - 1000,1001-2000,2001-3000 ...

5.将每个数字放入其块(但在放入之前只检查该数字是否已存在)如果不存在则插入。

6.当您的代码遍历所有时你将拥有该数据文件中唯一的文件。

7.现在将所有数据块文件放到一个文件中。



提示。

1.不要使用超过10个线程。

2.最初将每个文件分成10到20个KB大小。

3.不要创建超过3000的数据块(你可以使用序列化[二进制]但只是为了进程不要在内存中保存它。 )



如果您不理解任何部分,请随时询问。



对不起让回复


好的!

当时使用穿孔卡作为存储时,使用了一种技术。那些机器真的没有太大的内存:)

UNIQUE操作符可以被视为传递。我建议你拆分文件,把它们当成金字塔。制作文件只允许说100万条记录,但这取决于你。

 F1-0 ... F1-9< br / > 
F2-0 ... F2-9< br />
F3-0 ... F3-9



进行独特搜索 0U-10 = U (F1-0)(其中 U 是函数返回只有唯一记录的所有单独并保存它们比你将有:

0U-10 ... 0U-19,0U- 20 ... 0U-29,0U-30 ... 0U-39

所有人都只拥有独特的记录。

比合并一些保留考虑到限制,并进行搜索:

 1U10-11-12-13 =  U (0U1-10 + 0U1-11 + 0U1 -12 + 0U1-13)< br /> 
1U14-15-16 = U (0U-14 + 0U-15 + 0U-16)< br />
1U17-18-19 = U (0U-17 + 0U-18 + 0U-19)< br />
1U20-21-22-23 = U (0U-20 + 0U-21 + 0U-22 + 0U-23)< br />
...< br />



依此类推,直到你得到最终结果。这样你就不得不加载内存中的一小部分。


从你的帖子来看,目前还不清楚内存限制的确切位置。

如果你有文本文件,其中每个记录只包含一个数字,这不应该导致任何问题加载到 int <的列表数组 / code>。我估计每个列表的1000万个内存少于100 MB,这是可以管理的。

一旦你在内存中有这些数据, sort 它们并迭代三个数组以获得唯一值。

例如

 使用系统; 
使用 System.Collections.Generic;
使用 System.Linq;
使用 System.Text;
使用 System.IO;

命名空间 ConsoleApplication
{
public class NumData
{
public class 结果
{
public int 计数{获得; private set ; }
public int 值{ get ; private set ; }
private void 设置( int value int count){Value = value < /跨度>;计数=计数; }
public 结果( int value int count){Set( value ,count); }
/// < 摘要 >
/// 获取所有(有效)迭代器的当前位置的最小值,
/// 将min迭代器移动一个位置,
/// 并计算具有该最小值的迭代器数。
/// < / summary >
/// < 备注 < span class =code-summarycomment>> 迭代器必须指向具有* no *多个相同条目的排序数据。< /备注 >
/// < 返回 >
/// true =有效结果(值=当前值,Count =具有该值的迭代器数),
/// false =没有迭代器数据剩余
/// < / returns >
public bool 移动( params Iterator []游标)
{
if (游标== null || cursors.Length == 0 || !cursors.Any(c = > c.IsValid)) return ;
int min = cursors.Where(c = > c.IsValid).Min( c = > c.Curr);
int count = 0 ;
foreach var cursor in cursors.Where(c = > c.IsValid))
{
if (cursor.Curr == min)
{
count ++;
cursor.Next();
}
}
Set(min,count);
return true ;
}
}

public class Iterator
{
/// < 摘要 > Iterator Fasade正常处理数据结束(IsValid属性)。< / summary >
public Iterator(List< int> list)
{
_items = list.GetEnumerator();
IsValid = false ;
}
private IEnumerator< int> _items;
public bool IsValid { get ; private set ; }
public bool Next(){IsValid = _items.MoveNext(); return IsValid; }
public int Curr { get { if (!IsValid) throw new InvalidOperationException( 超越迭代结束); return _items.Current; }
}

private List< int> _list;
public NumData( string path)
this ()
{
Console.WriteLine( .. .Load ...);
使用(TextReader reader = File.OpenText(path))
{
while (Add(reader.ReadLine())){}
}
SortUnique();
}

public NumData()
{
_list = new List< int>();
}

public bool 添加( string s)
{
int n;
if string .IsNullOrEmpty(s)||!int.TryParse(s,< span class =code-keyword> out
n)) return false ;
Add(n);
return true ;
}

public void 添加( int n)
{
_list.Add(n);
}

public void SortUnique()
{
Console.Write( ... Sort-Unique {0} values ...,_ list.Count);
if (_ list.Count > 0
{
// sort ...
_list.Sort();

// ...并删除重复项
列表< INT> list = new List< int>();
int last = _list [ 0 ];
list.Add(last);
foreach int value _list)
{
if (last!= value
{
last = value ;
list.Add( value );
}
}
_list = list;
}
Console.WriteLine( 结果为{0}值, _list.Count);
}

public Iterator Cursor { get { return new Iterator(_list); }
}

class 计划
{
static void Main( string [] args)
{
// 获取迭代器并移动到第一个值
var pos1 = new NumData( .. \\F1.txt)光标;
var pos2 = new NumData( .. \\F2.txt)。光标;
var pos3 = new NumData( .. \\F3.txt)。光标;
pos1.Next();
pos2.Next();
pos3.Next();
// 准备结果
var result = new NumData();
var res = new NumData.Result( 0 0 );

// 在所有排序唯一列表上并发迭代,并报告每个值的计数出现
int i = 0 ;
while (res.Move(pos1,pos2,pos3))
{
if ((res.Count == 1 ))result.Add(res.Value);
if (++ i% 1000000 == 0 )Console.WriteLine( ... {0},i) ;
}

// 显示结果
< span class =code-keyword> var
pos = result.Cursor;
while (pos.Next())
{
Console.WriteLine( {0},pos.Curr);
}
Console.WriteLine( 完成(按任意键退出...) );
Console.ReadKey();
}

}
}





干杯

Andi


I have three files F1,F2 and F3.
Each file contains 10 million records (all numbers).
Now i have to remove duplicates from all these files eg number N1 should be unique in all the files .
I have approached this by loading each file in differcent hashset .Since memory out of exception is thrown if i store all records in a single hashset .
Is there any way by which i can check unique values in all the three hashSets ?
Or is there any better way to solve this problem without using any database ?

解决方案

Hi,

Nice question,

Solution :
http://msdn.microsoft.com/en-us/library/dd997372.aspx[^]

Better Performance[^]

Memory Map File.

I have some other solution if you interested :)

[Update]

You concern is to Save memory right ?

You can use Split file and search for unique, but the problem is when your searching for unique numbers comes to end it may cross the memory limit.

So its better to first sort then search,

For best performance,

Fallow these Steps :

1. Split File in small parts.
2. Write a function which will number from each file and delete it from that file.
3. Start separate threads and pass each file to that function.
4. Create some data block files like 1 - 1000,1001-2000,2001-3000...
5. Put each number to its block ( but before putting it just check is that number already exists or not) if not exists then inset.
6. When your code will traverse all the file you will already have the unique in that data files.
7. Now marge all data blocks files to a single file.

Tips.
1. Don''t use more that 10 threads.
2. Initially split each file''s in 10 to 20 KB size.
3. Don''t create data blocks more than 3000 ( You can use serialization [Binary] but just for process don''t hold it in memory. )

If you don''t understand any part then feel free to ask.

Sorry for let reply


Good one!
There was a technique used at the time, when punch-cards were used as storage. Those machines had really not much of memory :)
The UNIQUE "operator" can be treated as transitive. I suggest, you split the files, and treat them as a pyramid. Make files that are have let''s say only 1 million record, but it''s up to you.

F1-0...F1-9<br />
F2-0...F2-9<br />
F3-0...F3-9


Do an unique search 0U-10 = U(F1-0) (where U is the function returns that only unique records) on all separately and save them, than you will have:
0U-10...0U-19, 0U-20...0U-29, 0U-30...0U-39
all having only unique records.
Than merge some keeping the limit in mind, and do the search:

1U10-11-12-13 = U(0U1-10 + 0U1-11 + 0U1-12 + 0U1-13)<br />
1U14-15-16 = U(0U-14 + 0U-15 + 0U-16)<br />
1U17-18-19 = U(0U-17 + 0U-18 + 0U-19)<br />
1U20-21-22-23 = U(0U-20 + 0U-21 + 0U-22 + 0U-23)<br />
...<br />


And so on until you have the final result. This way you will have to load only small portions in the memory.


From your post it''s not clear exactly where the memory limit happens.
If you have text files where each record consists of one number only, than this should not cause any problem to load into a list or array of int. I estimate for 10 million numbers less than 100 MB memory per list, which is managable.
Once you have these data in memory, sort them and iterate over the three arrays to get the unique values.
E.g.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication
{
    public class NumData
    {
        public class Result
        {
            public int Count { get; private set; }
            public int Value { get; private set; }
            private void Set(int value, int count) { Value = value; Count = count; }
            public Result(int value, int count) { Set(value, count); }
            /// <summary>
            /// Get the min value of the current position of all (valid) iterators,
            /// move the min iterators by one position,
            /// and count the number of iterators with that min value.
            /// </summary>
            /// <remarks>The iterators must point to sorted data that have *no* multiple identical entries.</remarks>
            /// <returns>
            /// true = valid result (Value = current value, Count = number of iterators with that value),
            /// false = no iterator data left
            /// </returns>
            public bool Move(params Iterator[] cursors)
            {
                if (cursors == null || cursors.Length == 0 || !cursors.Any(c => c.IsValid)) return false;
                int min = cursors.Where(c => c.IsValid).Min(c => c.Curr);
                int count = 0;
                foreach (var cursor in cursors.Where(c => c.IsValid))
                {
                    if (cursor.Curr == min)
                    {
                        count++;
                        cursor.Next();
                    }
                }
                Set(min, count);
                return true;
            }
        }

        public class Iterator
        {
            /// <summary>Iterator Fasade to handle end of data gracefully (IsValid property).</summary>
            public Iterator(List<int> list)
	        {
                _items = list.GetEnumerator();
                IsValid = false;
	        }
            private IEnumerator<int> _items;
            public bool IsValid { get; private set; }
            public bool Next() { IsValid = _items.MoveNext(); return IsValid; }
            public int Curr { get { if (!IsValid) throw new InvalidOperationException("Beyond end of iteration"); return _items.Current; } }
        }

        private List<int> _list;
        public NumData(string path) 
            : this()
        {
            Console.WriteLine("...Load...");
            using (TextReader reader = File.OpenText(path))
            {
                while (Add(reader.ReadLine())) {}
            }
            SortUnique();
        }

        public NumData()
        {
            _list = new List<int>();
        }

        public bool Add(string s)
        {
            int n;
            if (string.IsNullOrEmpty(s) || !int.TryParse(s, out n)) return false;
            Add(n);
            return true;
        }

        public void Add(int n)
        {
            _list.Add(n);
        }

        public void SortUnique()
        {
            Console.Write("...Sort-Unique {0} values ...", _list.Count);
            if (_list.Count > 0)
            {
                // sort ...
                _list.Sort();

                // ... and remove duplicates
                List<int> list = new List<int>();
                int last = _list[0];
                list.Add(last);
                foreach (int value in _list)
                {
                    if (last != value)
                    {
                        last = value;
                        list.Add(value);
                    }
                }
                _list = list;
            }
            Console.WriteLine(" results in {0} values", _list.Count);
        }

        public Iterator Cursor { get { return new Iterator(_list); } }
    }

    class Program
    {
        static void Main(string[] args)
        {
            // get iterators and get move to the first value each
            var pos1 = new NumData("..\\F1.txt").Cursor;
            var pos2 = new NumData("..\\F2.txt").Cursor;
            var pos3 = new NumData("..\\F3.txt").Cursor;
            pos1.Next();
            pos2.Next();
            pos3.Next();
            // prepare results
            var result = new NumData();
            var res = new NumData.Result(0, 0);

            // iterate concurrently over all sort-unique lists and report each value with its count of occurance
            int i = 0;
            while (res.Move(pos1, pos2, pos3))
            {
                if ((res.Count == 1)) result.Add(res.Value);
                if (++i % 1000000 == 0) Console.WriteLine("...{0}", i);
            }

            // show results
            var pos = result.Cursor;
            while (pos.Next())
            {
                Console.WriteLine("{0}", pos.Curr);
            }
            Console.WriteLine("Done (press any key to exit...)");
            Console.ReadKey();
        }

    }
}



Cheers
Andi


这篇关于如何在多个文本文件中获取不同的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆