SSIS 包在一台服务器上的运行时间延长了 500 倍 [英] SSIS Package runs for 500x longer on one server

查看:51
本文介绍了SSIS 包在一台服务器上的运行时间延长了 500 倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 SSIS 包——两个数据流任务,每个任务 8 个组件,从两个平面文件中读取,没什么了不起的.如果我在 BIDS 中运行它,可靠地需要大约 60 秒.我有一个沙盒数据库服务器,其中的包在一个作业中运行,它也需要可靠地 30-60 秒.在我的生产服务器上,使用相同包的相同作业需要 30 秒到 12 小时.

I have an SSIS package--two data flow tasks, 8 components each, reading from two flat files, nothing spectacular. If I run it in BIDS, it takes reliably about 60 seconds. I have a sandbox DB server with the package running in a job which also takes reliably 30-60 seconds. On my production server, the same job with the same package takes anywhere from 30 seconds to 12 hours.

在包上启用日志记录后,它看起来像是在一个或另一个(或两个)数据流任务的预执行阶段陷入困境 - 至少最初是这样.但我也可以看到数据进来——慢慢地,成块的,所以我认为它稍后会从那里继续前进.IO 子系统受到重创,SSIS 生成许多大型临时文件(价值约 150MB——我的输入数据文件加起来只有约 24MB)并且正在从这些文件中大量读取和写入(抖动?).

With logging enabled on the package, it looks like it bogs down--initially at least--in the pre-execute phase of one or the other (or both) data flow tasks. But I can also see the data coming in--slowly, in chunks, so I think it does move on from there later. The IO subsystem gets pounded, and SSIS generates many large temp files (about 150MB worth--my input data files are only about 24MB put together) and is reading and writing vigorously from those files (thrashing?).

值得注意的是,如果我将包的 BIDS 实例指向生产服务器,它仍然只需要大约 60 秒即可运行!所以它一定是在那里运行 dtexec,而不是数据库本身.

Of note, if I point my BIDS instance of the package at the production server, it still only takes about 60 seconds to run! So it must be something with running dtexec there, not the DB itself.

我已经尝试优化我的包,减少输入行字节大小,并且我让两个数据流任务串联而不是并行运行 - 无济于事.

I've already tried to optimize my package, reducing input row byte size, and I made the two data flow tasks run in series rather than in parallel--to no avail.

两个数据库服务器都运行 MSSQL 2008 R2 64 位,补丁级别相同.两台服务器都是同一台主机上的虚拟机,具有相同的资源分配.现在,生产服务器上的负载不应该比沙盒服务器上的负载高太多.我能看到的唯一区别是生产服务器运行的是 Windows Server 2008,而沙箱运行的是 Windows Server 2008 R2.

Both DB servers are running MSSQL 2008 R2 64-bit, same patch level. Both servers are VMs on the same host, with the same resource allocation. Load on the production server should not be that much higher than on the sandbox server right now. The only difference I can see is that the production server is running Windows Server 2008, while the sandbox is on Windows Server 2008 R2.

帮助!!!欢迎尝试任何想法,可能导致这种巨大差异的原因是什么?

Help!!! Any ideas to try are welcome, what could be causing this huge discrepancy?

这是我的包裹的样子……

Here's what my package looks like…

控制流程极其简单:

数据流如下:

第二个数据流任务完全相同,只是源文件和目标表不同.

The second data flow task is exactly the same, just with a different source file and destination table.

控制流中的完成约束只是为了让任务串行运行以尝试减少并发所需的资源(并不是它有助于解决问题)……这两个任务之间没有实际的依赖关系.

The completion constraint in the Control Flow is only there to make the tasks run serially to try and cut down on resources needed concurrently (not that it helped solve the problem)…there is no actual dependency between the two tasks.

我知道阻塞和部分阻塞转换的潜在问题(不能说我完全理解它们,但至少在某种程度上)并且我知道聚合和合并连接正在阻塞并且可能会导致问题.然而,这一切在除生产服务器之外的所有其他环境中都运行良好且快速……那又是什么呢?

I'm aware of potential issues with blocking and partially-blocking transforms (can't say I understand them completely, but somewhat at least) and I know the aggregate and merge join are blocking and could cause problems. However, again, this all runs fine and quickly in every other environment except the production server…so what gives?

Merge Join 的原因是让任务等待 Multicast 的两个分支完成.右分支在输入中找到最小日期时间并删除该日期之后表中的所有记录,而左分支携带新的输入记录进行插入——所以如果右分支在聚合和删除之前进行,新记录将被删除(这发生了).我不知道有更好的方法来管理这个.

The reason for the Merge Join is to make the task wait for both branches of the Multicast to complete. The right branch finds the minimum datetime in the input and deletes all records in the table after that date, while the left branch carries the new input records for insertion--so if the right branch proceeds before the aggregate and deletion, the new records will get deleted (this happened). I'm unaware of a better way to manage this.

删除记录"的错误输出总是空的——这是故意的,因为我实际上不希望合并中来自该分支的任何行(合并只是为了同步完成,如上所述).

The error output from "Delete records" is always empty--this is deliberate, as I don't actually want any rows from that branch in the merge (the merge is only there to synchronize completion as explained above).

请参阅下面关于警告图标的评论.

See comment below about the warning icons.

推荐答案

如果您打开了日志记录,最好是到 SQL Server,请添加 OnPipelineRowsSent 事件.然后,您可以确定它的所有时间都花在哪里.请参阅此帖子您的 IO 子系统受到撞击并生成所有这些临时文件是因为您不再能够将所有信息保存在内存中(由于您的异步转换).

If you have logging turned on, preferably to SQL Server, add the OnPipelineRowsSent event. You can then determine where it is spending all of its time. See this post Your IO subsystem getting slammed and generating all these temp files is because you are no longer able to keep all the information in memory (due to your async transformations).

来自链接文章的相关查询如下.它查看 sysdtslog90(SQL Server 2008+ 用户替代 sysssislog)中的事件并对它们执行一些时间分析.

The relevant query from the linked article is the following. It looks at events in the sysdtslog90 (SQL Server 2008+ users substitute sysssislog) and performs some time analysis on them.

;
WITH PACKAGE_START AS
(
    SELECT DISTINCT
        Source
    ,   ExecutionID
    ,   Row_Number() Over (Order By StartTime) As RunNumber
    FROM
        dbo.sysdtslog90 AS L
    WHERE
        L.event = 'PackageStart'
)
, EVENTS AS
(
    SELECT
        SourceID
    ,   ExecutionID
    ,   StartTime
    ,   EndTime
    ,   Left(SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)), CharIndex(':', SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)) ) - 2) As DataFlowSource
    ,   Cast(Right(message, CharIndex(':', Reverse(message)) - 2) As int) As RecordCount
    FROM
        dbo.sysdtslog90 AS L
    WHERE
        L.event = 'OnPipelineRowsSent'
)
, FANCY_EVENTS AS
(
    SELECT
        SourceID
    ,   ExecutionID
    ,   DataFlowSource
    ,   Sum(RecordCount) RecordCount
    ,   Min(StartTime) StartTime
    ,   (
            Cast(Sum(RecordCount) as real) /
            Case
                When DateDiff(ms, Min(StartTime), Max(EndTime)) = 0
                    Then 1
                Else DateDiff(ms, Min(StartTime), Max(EndTime))
            End
        ) * 1000 As RecordsPerSec
    FROM
        EVENTS DF_Events
    GROUP BY
        SourceID
    ,   ExecutionID
    ,   DataFlowSource
)
SELECT
    'Run ' + Cast(RunNumber As varchar) As RunName
,   S.Source
,   DF.DataFlowSource
,   DF.RecordCount
,   DF.RecordsPerSec
,   Min(S.StartTime) StartTime
,   Max(S.EndTime) EndTime
,   DateDiff(ms, Min(S.StartTime)
,   Max(S.EndTime)) Duration
FROM
    dbo.sysdtslog90 AS S
    INNER JOIN
        PACKAGE_START P
        ON S.ExecutionID = P.ExecutionID
    LEFT OUTER JOIN
        FANCY_EVENTS DF
        ON S.SourceID = DF.SourceID
        AND S.ExecutionID = DF.ExecutionID
WHERE
    S.message <> 'Validating'
GROUP BY
    RunNumber
,   S.Source
,   DataFlowSource
,   RecordCount
,   DF.StartTime
,   RecordsPerSec
,   Case When S.Source = P.Source Then 1 Else 0 End
ORDER BY
    RunNumber
,   Case When S.Source = P.Source Then 1 Else 0 End Desc

, DF.StartTime, Min(S.StartTime);

, DF.StartTime , Min(S.StartTime);

您可以使用此查询来辨别 Merge Join 组件是滞后组件.为什么两台服务器之间的性能不同,我现在不能说.

You were able to use this query to discern that the Merge Join component was the lagging component. Why it performs differently between the two servers, I can't say at this point.

如果您有能力在目标系统中创建表,您可以修改您的流程以拥有两个 2 数据流(并消除昂贵的异步组件).

If you have the ability to create a table in your destination system, you could modify your process to have two 2 data flows (and eliminate the costly async components).

  1. 第一个数据流将采用平面文件和派生列并将其放入临时表中.
  2. 然后,您将触发执行 SQL 任务来处理获取最小日期 + 删除逻辑.
  3. 然后,您可以从临时表中查询第二个数据流并将其捕捉到您的目的地.

这篇关于SSIS 包在一台服务器上的运行时间延长了 500 倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆