如何快速加载数据表 [英] How to load datatable very fast

查看:64
本文介绍了如何快速加载数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述





在我的Windows窗体应用程序中,我有一项任务是从记事本文本文件中读取10万条记录,然后需要修改他的数据

通过读取数据表中的数据并添加一些新列并进行一些操作,最后单击导出按钮

它应该生成一个excel文件。

对于读取数据表中的所有文本文件(1万字节记录),该过程需要15到20分钟的时间。所以如果在特定文件夹中有6个文本

文件导出文本文件表明智的是它需要将近2个小时。

你能不能在任何人解释如何通过在不到10分钟内将数据导出到excel文件来解决这个问题。

这里这个特殊的代码块执行将近15分钟。

Hi,

In my windows forms application I have a task to read 1 lakh records from a notepad text file and then need to modify he data
by reading the data in a datatable and adding some new columns and doing some manipulations and finally when click on export button
it should generate an excel file.
For reading all the text file(1 lakh records) in datatable the process is taking 15 to 20 minutes of time.So if there are 6 text
files in a particular folder to export the text files sheet wise it is taking almost 2 hours.
Could you please any one explain how to overcome this one by exporting the data into an excel file in less than 10 min.
Here this particular block of code is executing almost 15 minutes.

 //In the below for loop we are reading the values from the text file by splitting with comma and adding
//to the datatable.
for (int i = 1; i < lines.Count(); i++)
{
    if (i < lines.Count() - 1 && lines[i] == lines[i + 1]) continue;//to remove duplicate entries 26-May-16
    dr = dt.NewRow();
    values = lines[i].Split(new char[] { ',' });
    for (int j = 0; j < values.Count() && j < columns.Count(); j++)
        dr[j] = values[j];

    dt.Rows.Add(dr);

    DateTime = Convert.ToDateTime(dt.Rows[dt.Rows.Count - 1][Constants.Date].ToString());
    // strDateTime = DateTime.ToString("dd/MM/yyyy").Split('/');
    tempDate = DateTime.ToShortDateString();
    tempTime = DateTime.ToLongTimeString();
    FunName = dt.Rows[dt.Rows.Count - 1][Constants.FuncName].ToString();
    dt.Rows[dt.Rows.Count - 1][Constants.Index] = dt.Rows.Count;
    dt.Rows[dt.Rows.Count - 1][Constants.Time] = tempTime;
    dt.Rows[dt.Rows.Count - 1][Constants.Date] = tempDate;
    dt.AcceptChanges();

}





我的尝试:



我正在研究C#,它正在读取文本文件,然后操纵数据表中的数据。



What I have tried:

I am working on C# which is reading the text file and then manipulating the data in datatable.

推荐答案

这是 DataTable.AcceptChanges 会杀死你。



我做了这个快速而又脏的代码示例,如果我放 AcceptChanges 在循环内部,大约需要20分钟才能读取10万行。

当它被放置在循环之外时,它所属的位置大约需要1第二。

(我尝试了1 000 000行测试,在循环外用 AcceptChanges 花了2分钟。带 AcceptChanges 循环内部程序已经工作2小时但仍未完成)



我不知道你有多少列有,但考虑到我的例子的时间差,我不认为那么重要。



我甚至没有尝试优化这段代码既不是速度也不是内存。

It is the DataTable.AcceptChanges that kills you.

I made this quick and dirty code example and if I put AcceptChanges inside the loop, it takes around 20 minutes to read 100 000 rows.
When it is placed outside the loop, where it belongs, it takes around 1 second.
(I tried a test with 1 000 000 rows and it took 2 minutes with the AcceptChanges outside the loop. With AcceptChanges inside the loop the program has been working for 2 hours and still not done)

I have no idea how many columns you have, but given the time difference with my example I don't think that matters that much.

I didn't even try to optimize this code, neither for speed nor for memory.
static void Main(string[] args)
{
    // Create text file
    using (TextWriter tw = new StreamWriter(@"C:\Temp\LargeCsvFile.csv"))
    {
        for (int i = 0; i < 100000; i++)
        {
            tw.WriteLine("Harry,Hacker,{0},2,3,4,5.0,{1}", i, DateTime.Now);
        }
    }


    DataTable dtRecords = new DataTable("Records");

    // Use an auto increment column instead of adding the the number of rows yourself
    DataColumn dcIndex = dtRecords.Columns.Add("Index", typeof(int));
    dcIndex.AutoIncrement = true;
    dcIndex.AutoIncrementSeed = 1;
    dcIndex.AutoIncrementStep = 1;

    DataColumn dcFirstName = dtRecords.Columns.Add("FirstName", typeof(string));
    DataColumn dcLastName = dtRecords.Columns.Add("LastName", typeof(string));
    DataColumn dcCol1 = dtRecords.Columns.Add("Col1", typeof(int));
    DataColumn dcCol2 = dtRecords.Columns.Add("Col2", typeof(int));
    DataColumn dcCol3 = dtRecords.Columns.Add("Col3", typeof(int));
    DataColumn dcCol4 = dtRecords.Columns.Add("Col4", typeof(int));
    DataColumn dcCol5 = dtRecords.Columns.Add("Col5", typeof(double));
    DataColumn dcDate = dtRecords.Columns.Add("Date", typeof(DateTime));

    // Aggregated column
    DataColumn dcTime = dtRecords.Columns.Add("Time", typeof(DateTime));
    dcTime.Expression = "Date";

    Stopwatch sw = new Stopwatch();
    sw.Start();

    DataRow dr;
    DataColumn dc;
    string[] values;
    string previousLine = "";
    foreach (string line in File.ReadAllLines(@"C:\Temp\LargeCsvFile.csv"))
    {
        dr = dtRecords.NewRow();
        if (line.Equals(previousLine))
            continue;

        previousLine = line;

        values = line.Split(',');
        for (int i=1; i< dtRecords.Columns.Count-1; i++)
        {
            dc = dtRecords.Columns[i];
            dr[dc] = Convert.ChangeType(values[i-1], dc.DataType);
        }
        dtRecords.Rows.Add(dr);

        // This line inside the loop increases the time tremendously
        // dtRecords.AcceptChanges(); 
    }
    dtRecords.AcceptChanges();
    sw.Stop();
    long time = sw.ElapsedMilliseconds;
    Console.WriteLine("Time {0}", time / 1000);





[更新]

刚出于兴趣,我每10000行实现一次时间打印,这就是结果。

这当然是循环内的 AcceptChanges



[UPDATE]
Just out of interest, I implemented a time print every 10000 rows and here is the result.
This is of course with AcceptChanges inside the loop.

Row No   hh:mm:ss
 10000 - 00:00:04
 20000 - 00:00:21
 30000 - 00:00:56
 40000 - 00:01:48
 50000 - 00:03:02
 60000 - 00:04:34
 70000 - 00:06:28
 80000 - 00:08:37
 90000 - 00:11:08
100000 - 00:13:53
110000 - 00:16:59
120000 - 00:20:32
130000 - 00:24:32
140000 - 00:28:44
150000 - 00:33:12
160000 - 00:37:57
170000 - 00:43:29
180000 - 00:49:27
190000 - 00:55:44
200000 - 01:02:12
210000 - 01:09:10
220000 - 01:16:32
230000 - 01:24:08
240000 - 01:31:57
250000 - 01:40:26



令人惊讶的是,一行错误的代码行会造成如此巨大的时间损失。



[UPDATE2]

处理600 000行需要10个小时,然后我就放弃了。


It is amazing how one misplaced line of code can cause such a huge time penalty.

[UPDATE2]
It took 10 hours to process 600 000 rows, then I gave up.


尝试使用CSV阅读器:快速CSV阅读器 [ ^ ]非常好。它对我来说总是足够快......
Try using a CSV reader instead: A Fast CSV Reader[^] is pretty good. It's always been fast enough for me...


这篇关于如何快速加载数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆