如何插入400万条记录从Oracle使用C#Elasticsearch表更快? [英] How to insert 4 million records from Oracle to Elasticsearch table faster using C#?

查看:344
本文介绍了如何插入400万条记录从Oracle使用C#Elasticsearch表更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下的code C#编写的,但据认为,它会带我4-5天,从Oracle数据库中的数据迁移到Elasticsearch。我在100批次中插入的记录是否有400万条记录的迁移发生更快的任何其他方式(可能在更短的,如果可能的话超过一天)?

 公共静态无效的选择()
        {
            的for(int i = 1; I< 4000000; I + = 1000)
            {
                为(中间体J =; J&其中;第(i + 1000); J + = 100)
                {
                    的OracleCommand CMD =新的OracleCommand(BuildQuery对于(J)
                                                     oracle_connection);
                    OracleDataReader读者= cmd.ExecuteReader();
                    名单<记录>表= CreateRecordList(读卡器);
                    插入(名单);
                }
            }
        }

   私有静态列表<记录> CreateRecordList(OracleDataReader阅读器)
        {
            名单<记录> L =新的名单,其中,记录>();
            字符串[]海峡=新的字符串[7];
            尝试
            {
                而(reader.Read())
                {
                    的for(int i = 0;我7;;我++)
                    {
                        海峡[i] =读卡器[I]的ToString();
                    }

                    记录r =新记录(海峡[0],海峡[1],海峡[2],海峡[3],
                                海峡[4],海峡[5],海峡[6]);
                    l.Add(r)的;
                }
            }
            赶上(例外ER)
            {
                弦乐味精= er.Message;
            }
            返回L;
        }

   私人静态字符串BuildQuery对于(INT从)
        {
            诠释从+变化=  -  1;
            StringBuilder的建设者=新的StringBuilder();
            builder.AppendLine(@从选择*);
            builder.AppendLine(();
            builder.AppendLine(选择FIELD_1,FIELD_2,
            FIELD_3,FIELD_4,FIELD_5,FIELD_6,
            FIELD_7,);
            builder.Append(ROW_NUMBER()OVER(ORDER BY FIELD_1)
             RN);
            builder.AppendLine(从表名);
            builder.AppendLine());
            builder.AppendLine(的String.Format(里的氡{0}和{1},
            从到));
            builder.AppendLine(为了通过RN);
            返回builder.ToString();
        }

   公共静态无效的插入(表<记录→1)
        {
            尝试
            {
                的foreach(交运集团记录r)
                    client.Index<记录>(R,指数,型);
            }
            赶上(例外ER)
            {
                弦乐味精= er.Message;
            }
        }
 

解决方案

ROW_NUMBER()函数将性能产生负面影响,而你正在运行的是数千次。你已经使用了 OracleDataReader - 它会不会把所有的四个百万行到你的机器一次,它基本上流他们一个或几个的时间。

这已经是可行的在几分钟或几小时,而不是几天 - 我们有将数以百万计的Sybase和SQL服务器之间的记录以类似的方式几个过程,它需要不到五分钟

也许给这一个镜头:

 的OracleCommand CMD =新的OracleCommand(SELECT ... FROM表名,oracle_connection);
INT BATCHSIZE = 500;
使用(OracleDataReader读卡器= cmd.ExecuteReader())
{
    名单<记录> L =新的名单,其中,记录>(BATCHSIZE);
    字符串[]海峡=新的字符串[7];
    INT currentRow = 0;

    而(reader.Read())
    {
        的for(int i = 0;我7;;我++)
        {
            海峡[i] =读卡器[I]的ToString();
        }

        l.Add(新记录(海峡[0],海峡[1],海峡[2],海峡[3],海峡[4],海峡[5],海峡[6]));

        //提交BATCHSIZE记录已每次读取
        如果(++ currentRow == BATCHSIZE)
        {
            提交(升);
            l.Clear();
            currentRow = 0;
        }
    }

    //提交剩余的记录
    提交(升);
}
 

下面就是提交可能是这样的:

 公共无效提交(IEnumerable的<记录>记录)
{
    // TODO:使用ES的散装特点,我不知道确切的语法

    client.IndexMany<记录>(记载,指数,型);
    // client.Bulk(B => b.IndexMany(记录))......这样的事情
}
 

I have the following code written in C# but according to that, it would take me 4-5 days to migrate the data from Oracle database to Elasticsearch. I am inserting the records in batches of 100. Is there any other way that the migration of the 4 million records takes place faster (probably in less than a day, if possible)?

   public static void Selection()
        {
            for(int i = 1; i < 4000000; i += 1000)
            {
                for(int j = i; j < (i+1000); j += 100)
                {
                    OracleCommand cmd = new OracleCommand(BuildQuery(j), 
                                                     oracle_connection);
                    OracleDataReader reader = cmd.ExecuteReader();
                    List<Record> list=CreateRecordList(reader);
                    insert(list);
                }
            }
        }

   private static List<Record> CreateRecordList(OracleDataReader reader)
        {
            List<Record> l = new List<Record>();
            string[] str = new string[7];
            try
            {
                while (reader.Read())
                {
                    for (int i = 0; i < 7; i++)
                    {
                        str[i] = reader[i].ToString();
                    }

                    Record r = new Record(str[0], str[1], str[2], str[3],                              
                                str[4], str[5], str[6]);
                    l.Add(r);
                }
            }
            catch (Exception er)
            {
                string msg = er.Message;
            }
            return l;
        }

   private static string BuildQuery(int from)
        {
            int to = from + change - 1;
            StringBuilder builder = new StringBuilder();
            builder.AppendLine(@"select * from");
            builder.AppendLine("(");
            builder.AppendLine("select FIELD_1, FIELD_2, 
            FIELD_3, FIELD_4, FIELD_5, FIELD_6, 
            FIELD_7, ");
            builder.Append(" row_number() over(order by FIELD_1) 
             rn");
            builder.AppendLine("   from tablename");
            builder.AppendLine(")");
            builder.AppendLine(string.Format("where rn between {0} and {1}", 
            from, to));
            builder.AppendLine("order by rn");
            return builder.ToString();
        }

   public static void insert(List<Record> l)
        {
            try
            {
                foreach(Record r in l)
                    client.Index<Record>(r, "index", "type");
            }
            catch (Exception er)
            {
                string msg = er.Message;
            }
        }

解决方案

The ROW_NUMBER() function is going to negatively impact performance, and you're running it thousands of times. You're already using an OracleDataReader -- it will not pull all four million rows to your machine at once, it's basically streaming them one or a few at a time.

This has to be doable in minutes or hours, not days -- we have several processes that move millions of records between a Sybase and SQL server in a similar manner and it takes less than five minutes.

Maybe give this a shot:

OracleCommand cmd = new OracleCommand("SELECT ... FROM TableName", oracle_connection);
int batchSize = 500;    
using (OracleDataReader reader = cmd.ExecuteReader())
{
    List<Record> l = new List<Record>(batchSize);
    string[] str = new string[7];
    int currentRow = 0;

    while (reader.Read())
    {
        for (int i = 0; i < 7; i++)
        {
            str[i] = reader[i].ToString();
        }

        l.Add(new Record(str[0], str[1], str[2], str[3], str[4], str[5], str[6]));

        // Commit every time batchSize records have been read
        if (++currentRow == batchSize)
        {
            Commit(l);
            l.Clear();
            currentRow = 0;
        }
    }

    // commit remaining records
    Commit(l);
}

Here's what Commit might look like:

public void Commit(IEnumerable<Record> records)
{
    // TODO: Use ES's BULK features, I don't know the exact syntax

    client.IndexMany<Record>(records, "index", "type");
    // client.Bulk(b => b.IndexMany(records))... something like this
}

这篇关于如何插入400万条记录从Oracle使用C#Elasticsearch表更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆