bcp/BULK INSERT 与表值参数的性能 [英] Performance of bcp/BULK INSERT vs. Table-Valued Parameters

查看:21
本文介绍了bcp/BULK INSERT 与表值参数的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将不得不使用 SQL Server 的 BULK INSERT 命令重写一些相当旧的代码,因为架构已更改,我突然想到也许我应该考虑切换到存储过程改用 TVP,但我想知道它可能对性能产生什么影响.

I'm about to have to rewrite some rather old code using SQL Server's BULK INSERT command because the schema has changed, and it occurred to me that maybe I should think about switching to a stored procedure with a TVP instead, but I'm wondering what effect it might have on performance.

一些可能有助于解释为什么我问这个问题的背景信息:

Some background information that might help explain why I'm asking this question:

  • 数据实际上是通过网络服务传入的.Web 服务将一个文本文件写入数据库服务器上的共享文件夹,然后执行BULK INSERT.这个过程最初是在SQL Server 2000上实现的,当时除了在服务器上扔几百个INSERT语句之外真的别无选择,这实际上是原始过程并且是性能灾难.

  • The data actually comes in via a web service. The web service writes a text file to a shared folder on the database server which in turn performs a BULK INSERT. This process was originally implemented on SQL Server 2000, and at the time there was really no alternative other than chucking a few hundred INSERT statements at the server, which actually was the original process and was a performance disaster.

数据被批量插入到一个永久的临时表中,然后合并到一个更大的表中(之后它会从临时表中删除).

The data is bulk inserted into a permanent staging table and then merged into a much larger table (after which it is deleted from the staging table).

要插入的数据量是大"的,但不是大"的——通常是几百行,在极少数情况下可能是 5-10k 行.因此,我的直觉是 BULK INSERT 作为一个非日志操作不会使 有很大的不同(但当然我不确定,因此问题).

The amount of data to insert is "large", but not "huge" - usually a few hundred rows, maybe 5-10k rows tops in rare instances. Therefore my gut feeling is that BULK INSERT being a non-logged operation won't make that big a difference (but of course I'm not sure, hence the question).

插入实际上是一个更大的流水线批处理过程的一部分,需要连续发生多次;因此,性能至关重要的.

The insertion is actually part of a much larger pipelined batch process and needs to happen many times in succession; therefore performance is critical.

我想用 TVP 替换 BULK INSERT 的原因是:

The reasons I would like to replace the BULK INSERT with a TVP are:

  • 通过 NetBIOS 编写文本文件可能已经花费了一些时间,而且从架构的角度来看,这非常可怕.

  • Writing the text file over NetBIOS is probably already costing some time, and it's pretty gruesome from an architectural perspective.

我相信可以(并且应该)消除临时表.它存在的主要原因是插入的数据需要在插入的同时用于其他几个更新,并且尝试从大量生产表进行更新比使用几乎为空的暂存要昂贵得多桌子.对于 TVP,参数基本上临时表,我可以在主插入之前/之后用它做任何我想做的事情.

I believe that the staging table can (and should) be eliminated. The main reason it's there is that the inserted data needs to be used for a couple of other updates at the same time of insertion, and it's far costlier to attempt the update from the massive production table than it is to use an almost-empty staging table. With a TVP, the parameter basically is the staging table, I can do anything I want with it before/after the main insert.

我几乎可以消除重复检查、清理代码以及与批量插入相关的所有开销.

I could pretty much do away with dupe-checking, cleanup code, and all of the overhead associated with bulk inserts.

如果服务器一次收到几个这样的事务,则无需担心临时表或临时数据库上的锁争用(我们试图避免它,但它发生了).

No need to worry about lock contention on the staging table or tempdb if the server gets a few of these transactions at once (we try to avoid it, but it happens).

我显然会在将任何东西投入生产之前对此进行概要分析,但我认为在我花费所有时间之前先询问一下可能是个好主意,看看是否有人对使用 TVP 发出任何严厉警告这个目的.

I'm obviously going to profile this before putting anything into production, but I thought it might be a good idea to ask around first before I spend all that time, see if anybody has any stern warnings to issue about using TVPs for this purpose.

那么 - 对于任何对 SQL Server 2008 足够熟悉并尝试过或至少进行过调查的人来说,结论是什么?对于插入,比如说,几百到几千行,发生的频率相当高,TVP 会切芥末吗?与批量插入相比,性能是否有显着差异?

So - for anyone who's cozy enough with SQL Server 2008 to have tried or at least investigated this, what's the verdict? For inserts of, let's say, a few hundred to a few thousand rows, happening on a fairly frequent basis, do TVPs cut the mustard? Is there a significant difference in performance compared to bulk inserts?

(又名:测试结果)

在经历了 36 个阶段的部署过程之后,最终结果现已投入生产.这两种解决方案都经过了广泛的测试:

The end result is now in production after what feels like a 36-stage deployment process. Both solutions were extensively tested:

  • 删除共享文件夹代码并直接使用SqlBulkCopy 类;
  • 使用 TVP 切换到存储过程.

为了让读者能够了解究竟是什么被测试的,为了消除对这些数据可靠性的任何怀疑,这里是对这个导入过程的更详细的解释实际上:

Just so readers can get an idea of what exactly was tested, to allay any doubts as to the reliability of this data, here is a more detailed explanation of what this import process actually does:

  1. 从一个时间数据序列开始,通常大约有 20-50 个数据点(尽管有时可能高达几百个);

  1. Start with a temporal data sequence that is ordinarily about 20-50 data points (although it can sometimes be up a few hundred);

对它进行一大堆疯狂的处理,主要是独立于数据库.这个过程是并行的,所以(1)中大约有 8-10 个序列同时被处理.每个并行进程生成 3 个额外的序列.

Do a whole bunch of crazy processing on it that's mostly independent of the database. This process is parallelized, so about 8-10 of the sequences in (1) are being processed at the same time. Each parallel process generates 3 additional sequences.

取所有 3 个序列和原始序列并将它们组合成一个批次.

Take all 3 sequences and the original sequence and combine them into a batch.

将所有 8-10 个现已完成的处理任务的批次合并为一个大的超级批次.

Combine the batches from all 8-10 now-finished processing tasks into one big super-batch.

使用BULK INSERT 策略(请参阅下一步)或 TVP 策略(跳至步骤 8)导入它.

Import it using either the BULK INSERT strategy (see next step), or TVP strategy (skip to step 8).

使用 SqlBulkCopy 类将整个超级批次转储到 4 个永久临时表中.

Use the SqlBulkCopy class to dump the entire super-batch into 4 permanent staging tables.

运行一个存储过程,它 (a) 对 2 个表执行一系列聚合步骤,包括几个 JOIN 条件,然后 (b) 执行一个 MERGE 在使用聚合和非聚合数据的 6 个生产表上.(完成)

Run a Stored Procedure that (a) performs a bunch of aggregation steps on 2 of the tables, including several JOIN conditions, and then (b) performs a MERGE on 6 production tables using both the aggregated and non-aggregated data. (Finished)

OR

生成4个DataTable对象,包含要合并的数据;其中 3 个包含 CLR 类型,遗憾的是 ADO.NET TVP 没有正确支持这些类型,因此必须将它们作为字符串表示形式放入,这会稍微影响性能.

Generate 4 DataTable objects containing the data to be merged; 3 of them contain CLR types which unfortunately aren't properly supported by ADO.NET TVPs, so they have to be shoved in as string representations, which hurts performance a bit.

将 TVP 馈送到存储过程,该过程进行与 (7) 基本相同的处理,但直接使用接收到的表.(完成)

Feed the TVPs to a Stored Procedure, which does essentially the same processing as (7), but directly with the received tables. (Finished)

结果相当接近,但 TVP 方法最终平均表现更好,即使数据略超过 1000 行.

请注意,此导入过程会连续运行数千次,因此只需计算完成所有合并所需的小时数(是的,小时数)即可轻松获得平均时间.

Note that this import process is run many thousands of times in succession, so it was very easy to get an average time simply by counting how many hours (yes, hours) it took to finish all of the merges.

最初,平均合并几乎需要 8 秒才能完成(在正常负载下).删除 NetBIOS 杂项并切换到 SqlBulkCopy 将时间减少到几乎 7 秒.切换到 TVP 进一步将时间缩短至每批5.2 秒.对于运行时间以小时为单位的流程来说,吞吐量提高了 35% - 一点也不差.与 SqlBulkCopy 相比,它也有大约 25% 的改进.

Originally, an average merge took almost exactly 8 seconds to complete (under normal load). Removing the NetBIOS kludge and switching to SqlBulkCopy reduced the time to almost exactly 7 seconds. Switching to TVPs further reduced the time to 5.2 seconds per batch. That's a 35% improvement in throughput for a process whose running time is measured in hours - so not bad at all. It's also a ~25% improvement over SqlBulkCopy.

我实际上相当有信心,真正的改进远不止于此.在测试过程中,很明显最终合并不再是关键路径;相反,正在执行所有数据处理的 Web 服务开始因传入的请求数量而崩溃.CPU 和数据库 I/O 都没有真正达到极限,并且没有明显的锁定活动.在某些情况下,我们看到连续合并之间有几秒钟的空闲间隔.使用 SqlBulkCopy 时,有一点差距,但要小得多(半秒左右).但我想这将成为另一天的故事.

I am actually fairly confident that the true improvement was significantly more than this. During testing it became apparent that the final merge was no longer the critical path; instead, the Web Service that was doing all of the data processing was starting to buckle under the number of requests coming in. Neither the CPU nor the database I/O were really maxed out, and there was no significant locking activity. In some cases we were seeing a gap of a few idle seconds between successive merges. There was a slight gap, but much smaller (half a second or so) when using SqlBulkCopy. But I suppose that will become a tale for another day.

结论:对于在中型数据集上运行的复杂导入+转换过程,表值参数确实比 BULK INSERT 操作执行得更好.

Conclusion: Table-Valued Parameters really do perform better than BULK INSERT operations for complex import+transform processes operating on mid-sized data sets.

我想补充一点,以减轻部分支持临时表的人的任何担忧.在某种程度上,这整个服务是一个巨大的登台过程.流程的每一步都经过严格审核,因此我们不需要需要临时表来确定某些特定合并失败的原因(尽管实际上它几乎从未发生过).我们所要做的就是在服务中设置一个调试标志,它会中断调试器或将其数据转储到文件而不是数据库中.

I'd like to add one other point, just to assuage any apprehension on part of the folks who are pro-staging-tables. In a way, this entire service is one giant staging process. Every step of the process is heavily audited, so we don't need a staging table to determine why some particular merge failed (although in practice it almost never happens). All we have to do is set a debug flag in the service and it will break to the debugger or dump its data to a file instead of the database.

换句话说,我们已经对流程有足够的了解,不需要临时表的安全性;我们首先拥有临时表的唯一原因是避免在所有 INSERTUPDATE 语句上发生颠簸,否则我们将不得不使用这些语句.在最初的过程中,暂存数据无论如何都只在暂存​​表中存在几分之一秒,因此它在维护/可维护性方面没有任何价值.

In other words, we already have more than enough insight into the process and don't need the safety of a staging table; the only reason we had the staging table in the first place was to avoid thrashing on all of the INSERT and UPDATE statements that we would have had to use otherwise. In the original process, the staging data only lived in the staging table for fractions of a second anyway, so it added no value in maintenance/maintainability terms.

另请注意,我们没有用 TVP 替换了每个 BULK INSERT 操作.处理大量数据和/或除了将数据扔到数据库之外不需要对数据做任何特殊处理的几个操作仍然使用 SqlBulkCopy.我并不是说 TVP 是一种性能灵丹妙药,只是在这个涉及初始分段和最终合并之间的多次转换的特定实例中,它们比 SqlBulkCopy 取得了成功.

Also note that we have not replaced every single BULK INSERT operation with TVPs. Several operations that deal with larger amounts of data and/or don't need to do anything special with the data other than throw it at the DB still use SqlBulkCopy. I am not suggesting that TVPs are a performance panacea, only that they succeeded over SqlBulkCopy in this specific instance involving several transforms between the initial staging and the final merge.

所以你有它.指向 TToni 以找到最相关的链接,但我也很欣赏其他回复.再次感谢!

So there you have it. Point goes to TToni for finding the most relevant link, but I appreciate the other responses as well. Thanks again!

推荐答案

我还没有真正使用 TVP 的经验,但是 MSDN 中有一个很好的与 BULK INSERT 的性能比较图表 此处.

I don't really have experience with TVP yet, however there is an nice performance comparison chart vs. BULK INSERT in MSDN here.

他们说 BULK INSERT 的启动成本更高,但之后速度更快.在远程客户端场景中,他们在大约 1000 行(对于简单"服务器逻辑)处绘制线.从他们的描述来看,我会说你应该可以使用 TVP.性能损失(如果有的话)可能可以忽略不计,而且架构优势似乎非常好.

They say that BULK INSERT has higher startup cost, but is faster thereafter. In a remote client scenario they draw the line at around 1000 rows (for "simple" server logic). Judging from their description I would say you should be fine with using TVP's. The performance hit - if any - is probably negligible and the architectural benefits seem very good.

附带说明,您可以通过使用 SqlBulkCopy 对象避免服务器本地文件并仍然使用批量复制.只需填充一个 DataTable,并将其提供给 SqlBulkCopy 实例的WriteToServer"-Method.易于使用,而且速度非常快.

On a side note you can avoid the server-local file and still use bulk copy by using the SqlBulkCopy object. Just populate a DataTable, and feed it into the "WriteToServer"-Method of an SqlBulkCopy instance. Easy to use, and very fast.

这篇关于bcp/BULK INSERT 与表值参数的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆