如何加载混合记录类型的定宽文件?并且文件包含两个标题 [英] How to load mixed record type fixed width file? And also file contain two header

查看:88
本文介绍了如何加载混合记录类型的定宽文件?并且文件包含两个标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个任务来加载格式异常的文本文件.该文件也包含不需要的数据.它包含两个背对背的标头,并且每个标头的数据在备用行上指定.标题行在------之后开始.我需要同时读取标头及其对应的数据,并使用将其转储到某些Excel/表目标中.让我知道如何使用SSIS中的任何转换或脚本来解决此问题. 不知道该如何使用脚本任务.

现在,我正在读取文件的一列,并使用派生的列手动尝试使用substring函数对其进行拆分.但这仅适用于一个标头,并且编码类型太硬.我需要一些动态方法来直接读取标题行和数据行.

输入文件:

A1234-012                                         I N F O R M A T I C S  C O M P A N Y                                      08/23/17
PAGE    2 BATCH ABC                                           PAYMENT DATE & DUE DATE                                 EDIT PAGE  481
------------------------------------------------------------------------------------------------------------------------------------
 SEO  XRAT CLT     LOAN      OPENING  PAYMENT MATURIUH LOAN NEXE ORIG-AMT   OFF TO CATE  CONTC MON NO.TO  TOL NEL   S CUP CO IND PAT
 NOM  CODE NOM    NOMTER      DATE    DUO DATE  DATE   TIME PT #  MONEY         AQ LOAN  NUMBER    BLOCK   PAYMENT  U TYP GH OMG IND
       1-3 4-6  7-13/90-102  14-19    20-25     26-31 32-34 35-37 38-46   47-48 49 50-51 52-61  62  63      64-72  73 4-5 76 77 8-80
------------------------------------------------------------------------------------------------------------------------------------
 SEO  XRAT CLT     LOAN     A/C   A/C     MIN     MAX    MAX   PENDI  LATE CCH  L/F  PARTLYS  CUR   L/F      L/F     L/F
 NOM  CODE NOM    NOMTER    CODE FACTOR   MON     MON    ROAD   DAYS  MONE POT  L/A  L/F JAC  INT  VAD CD  USED PI  VAD DT
       1-3 4-6  7-13/90-102  14  15      20-23   24-29   30-34 35-37   38-42    43     44     49     60     61-63    64-69
USED-ID:
------------------------------------------------------------------------------------------------------------------------------------
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS

预期输出应为:

文件1:

 SEO  XRAT CLT     LOAN      OPENING  PAYMENT MATURIUH LOAN NEXE ORIG-AMT   OFF TO CATE  CONTC MON NO.TO  TOL NEL   S CUP CO IND PAT
 NOM  CODE NOM    NOMTER      DATE    DUO DATE  DATE   TIME PT #  MONEY         AQ LOAN  NUMBER    BLOCK   PAYMENT  U TYP GH OMG IND
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55

文件2:

 SEO  XRAT CLT     LOAN     A/C   A/C     MIN     MAX    MAX   PENDI  LATE CCH  L/F  PARTLYS  CUR   L/F      L/F     L/F
 NOM  CODE NOM    NOMTER    CODE FACTOR   MON     MON    ROAD   DAYS  MONE POT  L/A  L/F JAC  INT  VAD CD  USED PI  VAD DT
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS

解决方案

忽略前3行

要忽略前三行,您可以简单地将平面文件连接管理器配置为忽略它们,类似于:


拆分文件并删除不良行

1.配置连接管理器

此外,在平面文件连接管理器中,转到高级"选项卡,删除除一列以外的所有列,并将其数据类型更改为DT_STR,将MaxLength更改为4000.

添加两个连接管理器,每个目标文件一个,其中您必须仅定义一列,最大长度为4000:

2.配置数据流任务

添加一个数据流任务,并在其中添加一个平面文件源.选择源文件连接管理器.

使用以下表达式添加条件拆分:

File1

FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19

File2

FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10

上面的表达式是根据您在问题中提到的预期输出创建的,我很累在每个标头中搜索唯一的关键字,并根据出现的空间数拆分了数据行.

最后将每个输出映射到目标平面文件组件:

实验

执行结果显示在以下屏幕截图中:


更新1-删除重复项

要删除重复项,您必须参考以下链接:


更新2-仅删除重复的标头+用Tab键替换空格

如果只需要删除重复的标题,则可以分两个步骤进行操作:

  1. 在每个条件拆分输出之后添加脚本组件以标记不需要的行
  2. 添加条件拆分以根据脚本组件的输出过滤行

此外,由于column值不包含空格,因此您可以使用正则表达式用单个Tab替换空格以使文件一致.

脚本组件

在脚本组件中添加类型为DT_BOOL的输出列并将其命名为outFlag,还添加类型为DT_STR且长度等于4000的输出列outColumn0,然后选择Column0作为输入列./p>

然后在脚本编辑器(C#)中编写以下脚本:

首先请确保您添加RegularExpressions名称空间

using System.Text.RegularExpressions;

脚本代码

int SEOCount = 0;
int NOMCount = 0;

Regex regex = new Regex("[ ]{2,}", RegexOptions.None);


public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    if (Row.Column0.Trim().StartsWith("SEO"))
    {


        if (SEOCount == 0)
        {

            SEOCount++;
            Row.outFlag = true;

        }
        else
        {

            Row.outFlag = false;

        }



    }
    else if (Row.Column0.Trim().StartsWith("NOM"))
    {

        if (NOMCount == 0)
        {

            NOMCount++;
            Row.outFlag = true;

        }
        else
        {

            Row.outFlag = false;

        }

    }
    else if (Row.Column0.Trim().StartsWith("PAGE"))
    {
        Row.outFlag = false;
    }
    else
    {

        Row.outFlag = true;

    }


    Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), "\t");
}

有条件拆分

在每个脚本组件之后添加条件拆分,并使用以下表达式过滤重复的标头:

[outFlag] == True

并将条件拆分连接到目标. 确保将outColumn0映射到目标列.

打包链接

I got a task to load a strangely formatted text file. The file contains unwanted data too. It contains two headers back to back and data for each header is specified on alternate lines. Header rows start after ------. I need to read both the header along with its corresponding data and dump it into some Excel/table destination using. Let me know how to solve this using any transformation in SSIS or maybe with a script. Don't know how to use script task for this.

Right now I am reading the file in one column and using a derived column manually trying to split it using substring function. But that works for only one header and it is too hard coded type. I need some dynamic approach to read header rows as well as data row directly.

Input file:

A1234-012                                         I N F O R M A T I C S  C O M P A N Y                                      08/23/17
PAGE    2 BATCH ABC                                           PAYMENT DATE & DUE DATE                                 EDIT PAGE  481
------------------------------------------------------------------------------------------------------------------------------------
 SEO  XRAT CLT     LOAN      OPENING  PAYMENT MATURIUH LOAN NEXE ORIG-AMT   OFF TO CATE  CONTC MON NO.TO  TOL NEL   S CUP CO IND PAT
 NOM  CODE NOM    NOMTER      DATE    DUO DATE  DATE   TIME PT #  MONEY         AQ LOAN  NUMBER    BLOCK   PAYMENT  U TYP GH OMG IND
       1-3 4-6  7-13/90-102  14-19    20-25     26-31 32-34 35-37 38-46   47-48 49 50-51 52-61  62  63      64-72  73 4-5 76 77 8-80
------------------------------------------------------------------------------------------------------------------------------------
 SEO  XRAT CLT     LOAN     A/C   A/C     MIN     MAX    MAX   PENDI  LATE CCH  L/F  PARTLYS  CUR   L/F      L/F     L/F
 NOM  CODE NOM    NOMTER    CODE FACTOR   MON     MON    ROAD   DAYS  MONE POT  L/A  L/F JAC  INT  VAD CD  USED PI  VAD DT
       1-3 4-6  7-13/90-102  14  15      20-23   24-29   30-34 35-37   38-42    43     44     49     60     61-63    64-69
USED-ID:
------------------------------------------------------------------------------------------------------------------------------------
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS

Expected output should be:

FILE 1:

 SEO  XRAT CLT     LOAN      OPENING  PAYMENT MATURIUH LOAN NEXE ORIG-AMT   OFF TO CATE  CONTC MON NO.TO  TOL NEL   S CUP CO IND PAT
 NOM  CODE NOM    NOMTER      DATE    DUO DATE  DATE   TIME PT #  MONEY         AQ LOAN  NUMBER    BLOCK   PAYMENT  U TYP GH OMG IND
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55
454542 070 567    2136547895 08-08-18 08-06-18 11-02-18 123 256      62,222  LK  5 55 5463218975 5 3       5,555.22    33        H55

FILE 2:

 SEO  XRAT CLT     LOAN     A/C   A/C     MIN     MAX    MAX   PENDI  LATE CCH  L/F  PARTLYS  CUR   L/F      L/F     L/F
 NOM  CODE NOM    NOMTER    CODE FACTOR   MON     MON    ROAD   DAYS  MONE POT  L/A  L/F JAC  INT  VAD CD  USED PI  VAD DT
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS
025641 055 123    5144511352 B  .55321   2.55 6531.22                                                 H      #AS

解决方案

Ignore first 3 rows

To ignore first 3 rows you can simply configure the flat file connection manager to ignore them, similar to:


Split file and remove bad rows

1. Configure connection managers

In addition, in the flat file connection manager, go to the advanced tab and delete all columns except one and change its data type to DT_STR and the MaxLength to 4000.

Add two connection managers , one for each destination file where you must define only one column with max length = 4000:

2. Configure Data flow task

Add a Data Flow Task, And add a Flat File Source inside. Select the Source File connection manager.

Add a conditional split with the following expressions:

File1

FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19

File2

FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10

The expressions above are created based on the expected output you mentioned in the question, i tired to search for unique keywords inside each header and splitted the data rows based on the number of space occurrence.

Finally Map each output to a destination flat file component:

Experiments

The execution result is shown in the following screenshots:


Update 1 - Remove duplicates

To remove duplicates you must you can refer to the following link:


Update 2 - Remove only duplicates headers + Replace spaces with Tab

If you need only to remove duplicate headers then you can do this in two steps:

  1. Add a script component after each conditional split output to flag unwanted rows
  2. Add a conditional split to filter rows based on the script component output

In addition, because the columns values does not contains spaces you can use regular expression to replace spaces with single Tab to make the file consistent.

Script Component

In the Script Component add an output column of type DT_BOOL and name it outFlag also add a output column outColumn0 of type DT_STR and length equal to 4000 and select Column0 as Input Column.

Then write the following script in the Script Editor (C#):

First make sure that you add the RegularExpressions namespace

using System.Text.RegularExpressions;

Script Code

int SEOCount = 0;
int NOMCount = 0;

Regex regex = new Regex("[ ]{2,}", RegexOptions.None);


public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    if (Row.Column0.Trim().StartsWith("SEO"))
    {


        if (SEOCount == 0)
        {

            SEOCount++;
            Row.outFlag = true;

        }
        else
        {

            Row.outFlag = false;

        }



    }
    else if (Row.Column0.Trim().StartsWith("NOM"))
    {

        if (NOMCount == 0)
        {

            NOMCount++;
            Row.outFlag = true;

        }
        else
        {

            Row.outFlag = false;

        }

    }
    else if (Row.Column0.Trim().StartsWith("PAGE"))
    {
        Row.outFlag = false;
    }
    else
    {

        Row.outFlag = true;

    }


    Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), "\t");
}

Conditional Split

Add a conditional split after each Script Component and use the following expression to filter duplicate header:

[outFlag] == True

And connect the conditional split to the destination. Make Sure to map outColumn0 to the destination column.

Package link

这篇关于如何加载混合记录类型的定宽文件?并且文件包含两个标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆