在内存中加载大量结果集的最佳方法是什么? [英] What is the best way to load huge result set in memory?

查看:62
本文介绍了在内存中加载大量结果集的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加载2个巨大的结果集(源码和目标)来自不同的RDBMS,但我正在努力解决的问题是将2个巨大的结果
设置在内存中。


下面考虑从源和目标中提取数据的查询:

 Sql Server  - 选择Id作为LinkedColumn,通过LinkedColumn从Source order中选择CompareColumn 

Oracle -select Id作为LinkedColumn,CompareColumn来自Target order by LinkedColumn



来源记录:  12377200



目标记录:  12266800



以下是我尝试过的一些统计方法:



1)  2用于读取源数据和目标数据的开放式数据读取器方法


并行运行的总工作数= 3



Job1的时间= 01:47:25
$


Job1 = 01:47:25所用的时间



Job1 = 01:48:32所花费的时间



Id列没有索引。



这里花费的主要时间是:

 var dr = command.ExecuteReader(); 



问题: 

还有超时问题,我必须将`commandtimeout`保持为0(无穷大)并且它很糟糕。





2)  按块读取方法读取源和目标数据的块


  总工作数= 1

  块大小:100000

  拍摄时间:02:02:48

   Id列没有索引。





3)  按块读取方法读取源和目标数据的块:


  总工作数= 1

  块大小:100000

  拍摄时间:00:39:40

   Id列上有索引。





4)  2
用于阅读源数据和目标数据的开放数据阅读器方法:


  总工作数= 1

  索引:是

  时间:00:01:43




5)  2
用于读取源和目标数据的开放式数据读取器方法:


  并行运行的总工作数= 3 b $ b  索引:是

  时间:00:25:12




我确实观察到虽然在LinkedColumn上有索引确实提高了性能,但问题是我们正在处理可能有
索引或可能没有的第三方RDBMS表。
我们希望保持数据库服务器尽可能免费,因此数据读取器方法似乎并不是一个好主意,因为会有大量并行运行的作业会对我们不想要的数据库服务器施加太大的压力。



因此我们希望从源到目标获取资源内存中的记录,并进行1-1比较记录比较,保持数据库服务器空闲。



注意:  我
想在我的c#应用程序中执行此操作,并且不想使用 
SSIS  或  已链接
服务器



源Sql查询sql server management studio中的执行时间:00:01:41



目标Sql查询sql server management studio中的执行时间:00:01:40



在内存中读取大量结果集的最佳方法是什么?



代码:

 static void Main(string [] args)
{
//运行3并行作业
//任务< string> [] taskArray = {Task< string> .Factory.StartNew(()=> Compare()),
//Task<string>.Factory。 StartNew(()=> Compare()),
//Task<string>.Factory.StartNew(()=> Compare())
//};
比较( ); //运行单个作业
Console.ReadKey();
}
公共静态字符串Compare()
{
秒表秒表=新秒表();
stopwatch.Start();
var srcConnection = new SqlConnection(" Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand(" select Id为LinkedColumn,来自LinkedColumn",srcConnection的Source Order的CompareColumn;
var tgtConnection = new Sql连接("目标连接字符串");
tgtConnection.Open();
var command2 = new SqlCommand(" select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn",tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(@" d\.hh\:mm \:ss");
Console.WriteLine(a);
返回a;
}
私有静态IDataReader GetReader(SqlCommand命令)
{
command.CommandTimeout = 0;
return command.ExecuteReader(); // Culprit
}





解决方案

你是不是能够得到最小的,然后用最大的数据检索" ;其中"条款?

I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.

Considering below are the queries to pull data from source and target :

Sql Server -  select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn

Oracle -select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn

Records in Source : 12377200

Records in Target : 12266800

Following are the approaches i have tried with some statistics:

1) 2 open data reader approach for reading source and target data:

Total jobs running in parallel = 3

Time taken by Job1 = 01:47:25

Time taken by Job1 = 01:47:25

Time taken by Job1 = 01:48:32

There is no index on Id Column.

Major time is spend here :

var dr = command.ExecuteReader();


Problems : 
There are timeout issues also for which i have to kept `commandtimeout` to 0(infinity) and it is bad.

2) Chunk by chunk reading approach for reading source and target data:

   Total jobs = 1
   Chunk size : 100000
   Time Taken : 02:02:48
   There is no index on Id Column.

3) Chunk by chunk reading approach for reading source and target data:

   Total jobs = 1
   Chunk size : 100000
   Time Taken : 00:39:40
   Index is present on Id column.

4) 2 open data reader approach for reading source and target data:

   Total jobs = 1
   Index : Yes
   Time: 00:01:43

5) 2 open data reader approach for reading source and target data:

   Total jobs running in parallel = 3
   Index : Yes
   Time: 00:25:12

I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.We would like to keep database server as free as possible so data reader approach doesnt seems good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we dont want.

Hence we want to fetch records in my resource memory from source to target and do 1 - 1 records comparision keeping database server free.

Note : I want to do this in my c# application and dont want to use SSIS or Linked Server.

Source Sql Query Execution time in sql server management studio: 00:01:41

Target Sql Query Execution time in sql server management studio:00:01:40

What will be the best way to read huge result set in memory ?

Code :

static void Main(string[] args)
        {   
            // Running 3 jobs in parallel
             //Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
        //Task<string>.Factory.StartNew(() => Compare()),
        //Task<string>.Factory.StartNew(() => Compare())
        //};
            Compare();//Run single job
            Console.ReadKey();
        }
public static string Compare()
        {
            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            var srcConnection = new SqlConnection("Source Connection String");
            srcConnection.Open();
            var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
            var tgtConnection = new SqlConnection("Target Connection String");
            tgtConnection.Open();
            var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
            var drA = GetReader(command1);
            var drB = GetReader(command2);
            stopwatch.Stop();
            string a = stopwatch.Elapsed.ToString(@"d\.hh\:mm\:ss");
            Console.WriteLine(a);
            return a;
        }
      private static IDataReader GetReader(SqlCommand command)
        {
            command.CommandTimeout = 0;
            return command.ExecuteReader();//Culprit
        }


解决方案

Are you not able to get the smallest and then retrieve from the biggest the data with a "where" clause?


这篇关于在内存中加载大量结果集的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆