多行定界符 [英] Multiple row delimiters

查看:86
本文介绍了多行定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在SSIS中为平面文件连接定义多个行定界符?例如,对于包含以下字符串的文本文件,

How to define multiple row delimiters for a Flat File Connection in SSIS? for example for a text file containing this string:


在任何时候都需要保持文明;

Civility is required at all times; rudeness will not be tolerated.

我想在使用';''之后再使用这两行。 '作为行定界符:

I want to have this two rows after using ';' and '.' for row delimiter:


在任何时候都必须具有文明性

Civility is required at all times

不容忍粗鲁


推荐答案

对于源数据,我创建了3行文件

For source data, I created a 3 line file

Civility is required at all times; rudeness will not be tolerated.
The quick brown fox jumped over the lazy dogs.
I am but a single row with no delimiter beyond the carriage return

一般方法I下面采用的是使用格式为Ragged Right的平面文件连接管理器,我的标头行分隔符为{CR} {LF}。我定义了一列InputRow为String8000。YMMV

The general approach I have taken below is to use a flat file connection manager with a format of Ragged Right and my header row delimiter is {CR}{LF}. I defined one columns, InputRow as String 8000. YMMV

在我的数据流中,在平面文件源之后,我添加了一个脚本组件作为数据转换,称为拆分行。

In my data flow, after the flat file source, I add a script component as a data transformation called Split Rows.

在输入列选项卡上,检查InputRow并将其保留为ReadOnly,以便脚本可以访问该值。可以将其切换为ReadWrite并修改传出值,但这不适用于这种类型的操作。

On the Input Columns tab, check the InputRow and leave it as ReadOnly so the script can access the value. It'd be nice if you could switch it to ReadWrite and modify the outgoing values but that's not applicable for this type of operation.

默认情况下,脚本任务是同步组件,这意味着行内和行外之间存在1:1的关系。这将无法满足您的需求,因此您需要将其切换到异步模式。我将输出0重命名为OutputSplit,并将SynchronousInput的值从 Input 0(16)更改为None。您的16值可能会有所不同。

By default, a script task is a synchronous component, meaning there's a 1:1 relationship between rows in and rows out. This will not suit your needs so you will need to switch it over to Asynchronous mode. I renamed the Output 0 to OutputSplit and changed the value of SynchronousInput from "Input 0 (16)" to None. Your value for 16 may vary.

在OutputSplit的输出列上,添加一个名称为SplitRow DT_STR 8000的列。

On your Output Columns for OutputSplit, Add a Column with a name of SplitRow DT_STR 8000.

在脚本转换中,您只需要关心ProcessInputRow方法。字符串类提供了一个split方法,该方法采用字符值数组作为分隔符。当前,它在下面的数组初始化器中进行了硬编码,但是可以很容易地将其定义为变量并传递到脚本中。

Within your script transformation, you only need to be concerned with the ProcessInputRow method. The string class offers a split method that takes an array of character values that will work as the splitters. Currently, it is hard coded below in the array initializer but it could just as easily be defined as a variable and passed into the script. That is left as an exercise to the poster.

/// <summary>
/// we have to make this an async script as 1 input row can be many output rows
/// </summary>
/// <param name="Row"></param>
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string[] results = Row.InputRow.Split(new char[]{';', '.'});
    foreach (string line in results)
    {
        // Remove this line if it is desirable to have empty strings in the buffer
        if (!string.IsNullOrEmpty(line))
        {
            OutputSplitBuffer.AddRow();
            // You might want to call trim operations on the line
            OutputSplitBuffer.SplitRow = line;
        }
    }
}

我打了F5,瞧,

With all of this done, I hit F5 and voila,

这将是一个占用大量内存的程序包,具体取决于您运行的数据量。我敢肯定,可以进行一些优化,但这应该足以使您前进。

This is going to be a fairly memory intensive package depending on how much data you run through it. I am certain there are optimizations one could make but this should be sufficient to get you going.

这篇关于多行定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆