如何优化.NET中的代码性能 [英] How to Optimize Code Performance in .NET

查看:82
本文介绍了如何优化.NET中的代码性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个导出作业,将数据从旧数据库迁移到新数据库.我遇到的问题是用户总数约为300万,而这项工作需要很长时间才能完成(15个小时以上).我使用的机器只有1个处理器,所以我不确定threading是否应该做.有人可以帮助我优化此代码吗?

I have an export job migrating data from an old database into a new database. The problem I'm having is that the user population is around 3 million and the job takes a very long time to complete (15+ hours). The machine I am using only has 1 processor so I'm not sure if threading is what I should be doing. Can someone help me optimize this code?

static void ExportFromLegacy()
{
    var usersQuery = _oldDb.users.Where(x =>
        x.status == 'active');

    int BatchSize = 1000;
    var errorCount = 0;
    var successCount = 0;
    var batchCount = 0;

    // Using MoreLinq's Batch for sequences
    // https://www.nuget.org/packages/MoreLinq.Source.MoreEnumerable.Batch
    foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
    {
        Console.WriteLine(String.Format("Batch count at {0}", batchCount));
        batchCount++;

        foreach(var user in batch)
        {
            try
            {               
                var userData = _oldDb.userData.Where(x =>
                    x.user_id == user.user_id).ToList();

                if (userData.Count > 0)
                {                   
                    // Insert into table
                    var newData = new newData()
                    {                       
                        UserId = user.user_id; // shortened code for brevity.                       
                    };

                    _db.newUserData.Add(newData);
                    _db.SaveChanges();

                    // Insert item(s) into table
                    foreach (var item in userData.items)
                    {
                        if (!_db.userDataItems.Any(x => x.id == item.id)
                        {
                            var item = new Item()
                            {                               
                                UserId = user.user_id, // shortened code for brevity.   
                                DataId = newData.id // id from object created above
                            };

                            _db.userDataItems.Add(item);                            
                        }

                        _db.SaveChanges();
                        successCount++;
                    }
                }               
            }
            catch(Exception ex)
            {
                errorCount++;
                Console.WriteLine(String.Format("Error saving changes for user_id: {0} at {1}.", user.user_id.ToString(), DateTime.Now));
                Console.WriteLine("Message: " + ex.Message);
                Console.WriteLine("InnerException: " + ex.InnerException);
            }
        }
    }

    Console.WriteLine(String.Format("End at {0}...", DateTime.Now));
    Console.WriteLine(String.Format("Successful imports: {0} | Errors: {1}", successCount, errorCount));
    Console.WriteLine(String.Format("Total running time: {0}", (exportStart - DateTime.Now).ToString(@"hh\:mm\:ss")));
}

推荐答案

不幸的是,主要问题是数据库往返次数.

Unfortunately, the major issue is the number of database round-trip.

您往返:

  • 对于每个用户,您都可以通过旧数据库中的用户ID来检索用户数据
  • 对于每个用户,您都将用户数据保存在新数据库中
  • 对于每个用户,您将用户数据项保存在新数据库中

因此,如果您说您有300万用户,并且每个用户平均有5个用户数据项,则意味着您至少进行了3m + 3m + 15m = 2100万次数据库往返,这很疯狂.

So if you say you have 3 million users, and every user has an average of 5 user data item, it mean you do at least 3m + 3m + 15m = 21 million database round-trip which is insane.

显着提高性能的唯一方法是减少数据库往返次数.

The only way to dramatically improve the performance is by reducing the number of database round-trip.

批处理-按ID检索用户

通过一次检索所有用户数据,可以快速减少数据库往返的次数,并且由于不必跟踪它们,因此可以使用"AsNoTracking()"来获得更大的性能提升.

You can quickly reduce the number of database round-trip by retrieving all user data at once and since you don't have to track them, use "AsNoTracking()" for even more performance gains.

var list = batch.Select(x => x.user_id).ToList();
var userDatas = _oldDb.userData
                  .AsNoTracking()
                  .Where(x => list.Contains(x.user_id))
                  .ToList();

foreach(var userData in userDatas)
{
    ....
}

仅此更改您应该已经节省了几个小时.

You should already have saved a few hours only with this change.

批处理-保存更改

每次保存用户数据或项目时,都会执行数据库往返.

Every time you save a user data or item, you perform a database round-trip.

免责声明:我是该项目的所有者实体框架扩展

Disclaimer: I'm the owner of the project Entity Framework Extensions

该库允许执行:

  • BulkSaveChanges
  • BulkInsert
  • 批量更新
  • 批量删除
  • BulkMerge

您可以在批处理结束时调用BulkSaveChanges,也可以创建一个列表来插入并直接使用BulkInsert来获得更高的性能.

You can either call BulkSaveChanges at the end of the batch or create a list to insert and use directly BulkInsert instead for even more performance.

但是,您将不得不使用与newData实例的关系,而不是直接使用ID.

You will, however, have to use a relation to the newData instance instead of using the ID directly.

foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
    // Retrieve all users for the batch at once.
   var list = batch.Select(x => x.user_id).ToList();
   var userDatas = _oldDb.userData
                         .AsNoTracking()
                         .Where(x => list.Contains(x.user_id))
                         .ToList(); 

    // Create list used for BulkInsert      
    var newDatas = new List<newData>();
    var newDataItems = new List<Item();

    foreach(var userData in userDatas)
    {
        // newDatas.Add(newData);

        // newDataItem.OwnerData = newData;
        // newDataItems.Add(newDataItem);
    }

    _db.BulkInsert(newDatas);
    _db.BulkInsert(newDataItems);
}

编辑:回答子问题

newDataItem的属性之一是newData的ID. (前任. newDataItem.newDataId.)因此,必须先将newData保存在 为了生成其ID.如果有 另一个对象的依赖关系?

One of the properties of a newDataItem, is the id of newData. (ex. newDataItem.newDataId.) So newData would have to be saved first in order to generate its id. How would I BulkInsert if there is a dependency of an another object?

您必须改为使用导航属性.通过使用导航属性,您将不必指定父ID,而只需设置父对象实例.

You must use instead navigation properties. By using navigation property, you will never have to specify parent id but set the parent object instance instead.

public class UserData
{
    public int UserDataID { get; set; }
    // ... properties ...

    public List<UserDataItem> Items { get; set; }
}

public class UserDataItem
{
    public int UserDataItemID { get; set; }
    // ... properties ...

    public UserData OwnerData { get; set; }
}

var userData = new UserData();
var userDataItem = new UserDataItem();

// Use navigation property to set the parent.
userDataItem.OwnerData = userData;

教程:配置一个对多关系

此外,我在示例代码中没有看到BulkSaveChanges.会是 必须在所有BulkInserts之后调用?

Also, I don't see a BulkSaveChanges in your example code. Would that have to be called after all the BulkInserts?

批量插入直接插入到数据库中.您不必指定"SaveChanges"或"BulkSaveChanges",一旦调用该方法即可;)

Bulk Insert directly insert into the database. You don't have to specify "SaveChanges" or "BulkSaveChanges", once you invoke the method, it's done ;)

以下是使用BulkSaveChanges的示例:

Here is an example using BulkSaveChanges:

foreach (IEnumerable<users> batch in usersQuery.Batch(BatchSize))
{
    // Retrieve all users for the batch at once.
   var list = batch.Select(x => x.user_id).ToList();
   var userDatas = _oldDb.userData
                         .AsNoTracking()
                         .Where(x => list.Contains(x.user_id))
                         .ToList(); 

    // Create list used for BulkInsert      
    var newDatas = new List<newData>();
    var newDataItems = new List<Item();

    foreach(var userData in userDatas)
    {
        // newDatas.Add(newData);

        // newDataItem.OwnerData = newData;
        // newDataItems.Add(newDataItem);
    }

    var context = new UserContext();
    context.userDatas.AddRange(newDatas);
    context.userDataItems.AddRange(newDataItems);
    context.BulkSaveChanges();
}

由于必须使用Entity Framework中的某些内部方法,因此BulkSaveChanges比BulkInsert慢,但仍比SaveChanges快.

BulkSaveChanges is slower than BulkInsert due to having to use some internal method from Entity Framework but still way faster than SaveChanges.

在示例中,我为每个批次创建了一个新上下文,以避免内存问题并获得一些性能.如果您对所有批次重复使用相同的上下文,那么ChangeTracker中将有数百万个跟踪实体,这绝对不是一个好主意.

In the example, I create a new context for every batch to avoid memory issue and gain some performance. If you re-use the same context for all batchs, you will have millions of tracked entities in the ChangeTracker which is never a good idea.

这篇关于如何优化.NET中的代码性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆