具有可变列号的 SSIS 平面文件 [英] SSIS Flat Files with Variable Column Numbers

查看:21
本文介绍了具有可变列号的 SSIS 平面文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SSIS 在处理平面文件方面做了两件事,这特别令人沮丧,似乎应该有办法绕过它们,但我想不通.如果您定义一个包含 10 列的平面文件,以 CRLF 作为行尾标记的制表符分隔,这对于每行正好有 10 列的文件非常适用.两种痛苦的情况是:

SSIS does 2 things in relation to handling flat files which are particularly frustrating, and it seems there should be a way around them, but I can't figure it out. If you define a flat file with 10 columns, tab delimited with CRLF as the end of row marker this will work perfectly for files where there are exactly 10 columns in every row. The 2 painful scenarios are these:

  1. 如果有人在任何地方提供了第 11 列的文件,如果 SSIS 简单地忽略它会很好,因为您还没有定义它.它应该只读取您定义的 10 列,然后跳到行标记的末尾,但实际上是将任何其他数据与第 10 列中的数据连接起来,并将所有这些数据塞入第 10 列.真的有点没用.我意识到发生这种情况是因为第 10 列的分隔符不是像所有其他列一样的制表符,而是 CRLF,所以它只是将所有内容都抓取到 CRLF,并在这样做时用空替换多余的制表符.在我看来,这并不明智.

  1. If someone supplies a file with an 11th column anywhere, it would be nice if SSIS simply ignored it, since you haven't defined it. It should just read the 10 columns you have defined then skip to the end of row marker, but what is does instead is concatenate any additional data with the data in the 10th column and bung all that into the 10th column. Kind of useless really. I realise this happens because the delimiter for the 10th column is not tab like all the others, but CRLF, so it just grabs everything up to the CRLF, replacing extra tabs with nothing as it does so. This is not smart, in my opinion.

如果有人提供一个只有 9 列的文件,情况会更糟.它会暂时忽略意外发现的 CRLF,并用下一行开头的列填充任何缺失的列!不聪明在这里是轻描淡写.谁会希望这种情况发生?文件的其余部分此时是垃圾.

If someone supplies a file with only 9 columns something even worse happens. It will temporarily disregard the CRLF it has unexpectedly found and pad any missing columns with columns from the start of the next row! Not smart is an understatement here. Who would EVER want that to happen? The remainder of the file is garbage at that point.

无论出于何种原因,文件宽度的变化似乎都不是不合理的(当然,只有行尾的变化可以合理地处理(x 少或多列),但看起来这根本没有处理好,除非我遗漏了什么.

It doesn't seem unreasonable to have variations in file width for whatever reason (of course only variations at the end of a row can reaonably be handled (x fewer or extra columns) but it looks like this is simply not handled well, unless I'm missing something.

到目前为止,我们唯一的解决方案是将一行加载为一个巨大的列 (column0),然后使用脚本任务使用它找到的许多分隔符动态拆分它.这很有效,除了它将行宽限制为 4000 个字符(一个 unicode 列的最大宽度).如果您需要导入更宽的行(比如用于文本导入的多个 4000 宽列),那么您需要像上面一样定义多列,但随后您会被要求每行的列数严格限制.

So far our only solution to this is to load a row as one giant column (column0) and then use a script task to dynamically split it using however many delimiters it finds. This works well, except that it limits row widths to 4000 chars (the max width of one unicode column). If you need to import a wider row (say with multiple 4000 wide columns for text import) then you need to define multiple columns as above, but you are then stuck with requiring a strict number of columns per row.

有什么办法可以绕过这些限制吗?

Is there any way around these limitations?

推荐答案

Glenn,我感受到你的痛苦 :)SSIS 无法使列动态化,因为它需要在每列通过时存储元数据,并且由于我们正在处理可以包含任何类型数据的平面文件,因此它不能假设 '列中的 CRLF-that-is-not-that-last-column',确实是它应该读取的数据行的末尾.

Glenn, i feel your pain :) SSIS cannot make the columns dynamic, as it needs to store metadata of each column as it come through, and since we're working with flat files which can contain any kind of data, it can't assume that the CRLF in a 'column-that-is-not-that-last-column', is indeed the end of the data line its supposed to read.

与 SQL2000 中的 DTS 不同,您不能在运行时更改 SSIS 包的属性.

Unlike DTS in SQL2000, you can't change the properties of a SSIS package at runtime.

您可以做的是创建一个父包,它读取平面文件(脚本任务),并且只读取平面文件的第一行以获取列数和列名.此信息可以存储在变量中.

What you could do is create a parent package, that reads the flat file (script task), and only reads the first line of the flat file to get the number of columns, and the column names. This info can be stored in a variable.

然后,父包以编程方式加载子包(再次脚本任务),并更新子包的Source Connection的元数据.这是你要么1. 添加/删除列以匹配平面文件.2. 为列设置列分隔符,最后一列必须是 CRLF - 匹配 ROW 分隔符3. 重新初始化Dataflow 任务中Source Compoenent 的元数据(ComponentMetadata.ReinitializeMetadata())(以识别Source Connection 中最近的变化).4.保存子ssis包.

Then, the parent package loads the child package (script task again) programmatically, and updates the metadata of the Source Connection of the child package. This is where you would either 1. Add / remove columns to match the flat file. 2. Set the column delimiter for the columns, the last column has to be the CRLF - matching the ROW delimiter 3. Reinitialise the metadata (ComponentMetadata.ReinitializeMetadata()) of the Source Compoenent in the Dataflow task (to recognize the recent changes in the Source Connection). 4. Save the child ssis package.

有关以编程方式修改包的详细信息仅现成可用.

Details on programmatically modifying a package is readily available only.

然后,您的父包只执行子包(执行包任务),它将使用您的新映射执行.

Then, your parent package just executes the Child package (Execute Package Task), and it'll execute with your new mappings.

这篇关于具有可变列号的 SSIS 平面文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆