SSIS 派生列 - 在中断返回之间解析文本 [英] SSIS Derived Column - Parse Text between break returns

查看:25
本文介绍了SSIS 派生列 - 在中断返回之间解析文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自 SQL Server 源的文本字段.这是一个电话号码字段,通常具有以下格式:

I have a text field from a SQL Server Source. It is a phone number field that typically has this format:

Home: 555-555-1212
Work: 555-555-1212
Cell: 555-555-1212
Emergency: 555-555-1212

我正在尝试在字段之间进行拆分,以便只显示 555-555-1212

I'm trying to split among fields so that only 555-555-1212 is displayed

然后我将这个字段转换为字符串.这里的标签之间有字面上的中断返回 (\r\n).这里的目标是将这些数据拆分到多个字段(家庭、工作、单元格、紧急情况等)中.我正在研究如何在字段之间拆分文本,并取得了一些进展.在家庭号码的情况下,我使用了这个逻辑:

I am then taking this field and converting to a string. There are literally break returns (\r\n) between the labels here. The goal here is to have this data split among multiple fields (home,work,cell,emergency,etc.) I was researching how to split text among fields and I made some progress. In the case of home numbers, I used this logic:

SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"Home:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"Home:",1) - 5)

这很好用,因为它解析到文本返回,我得到 555-555-1212.​​

This works great as it parses up to the text return and I get 555-555-1212.

现在我在中断返回之间搜索文本时遇到问题.我为工作号码尝试了相同的逻辑:

Now I experience an issue when searching for a text between break returns. I tried the same logic for Work numbers:

 SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"Work:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"Work:",1) - 5)

但这不起作用并导致写入我的错误重定向文件.然后我尝试插入一个中断返回以找到开头的文本

But that won't work and results in writing to my error redirection file. I then tried to insert a break return to find the text at the beginning

SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"\nWork:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"\nWork:",1) - 5)

那里也没有运气.关于我如何解决这个问题的任何想法.另外,我很想知道我最后如何处理紧急标题.在那种情况下不会有中断返回,但我仍然想解析文本.

No luck there either. Any ideas on how I can address this. Also, I would appreciate an idea of how I can handle the emergency title at the end. There won't be a break return in that situation, but I still want to parse the text.

推荐答案

我看了你的数据,我明白了

I look at your data and I see

首页:|555-555-1212|工作:|555-555-1212|手机:|555-555-1212|急诊:|555-555-1212

Home:|555-555-1212|Work:|555-555-1212|Cell:|555-555-1212|Emergency:|555-555-1212

我使用竖线字符 | 作为我对该字符串进行分割的位置的占位符,这基本上是任何有空格(空格、制表符、换行符等)的地方.

I'm using the pipe character, |, as a placeholder for where I would segment that string, which is basically wherever you have whitespace (space, tab, newline, etc).

对此有两种方法.我将从简单的开始.

There are two approaches to this. I'll start with the easy one.

String.Split 是你的朋友.看看它对源数据做了什么

String.Split is your friend here. Look at what it did with that source data

我添加了一个新的脚本组件,充当转换并创建了 4 个输出列,所有字符串的长度均为 12 代码页 1252:Home、Work、Cell 和 Emergency.我像这样填充它们

I added a new Script Component, acting as a Transformation and created 4 output columns, all string of length 12 codepage 1252: Home, Work, Cell, and Emergency. I populate them like so

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string[] split = Row.PhoneData.Split();

    Row.Home = split[1];
    Row.Work = split[4];
    Row.Cell = split[7];
    Row.Emergency = split[10];
}

派生列

我不会构建一个完整的实现.以上内容非常简单,但我遇到过 ETL 开发人员说他们不允许使用脚本任务/组件的情况,这通常是因为人们首先而不是最后才找到它们.

Derived Column

I'm not going to build out a full blown implementation of this. The above is much to simple but I run into situations where ETL devs say they aren't allowed to use Script tasks/components and that's usually because people reached for them first instead of last.

这里的方法是在您的数据流中使用大量派生列组件.它不会损害性能,实际上可以使它更容易.它肯定会让你的调试更容易,因为你有很多事情要做.

The approach here is to have lots of Derived Columns Components on your Data Flow. It won't hurt performance and in fact can make it easier. It definitely will make your debugging easier as you'll have lots of it to do.

这会在数据流中添加 4 列 - HomeColonPosition、WorkColonPosition 等.您已经开始沿着这条路径走,但只需将其构建到实际数据流中,因为您需要引用这些位置,而且更容易修复填充列的计算与错误并在任何地方使用的计算.您可能会发现 4 个派生列在这里很有用,因为您希望使用前一个冒号的位置作为 FINDSTRING

This would add 4 columns into the dataflow - HomeColonPosition, WorkColonPosition etc. You've already started down this path but just build it out into the actual data flow as you'll need to reference these positions and again, it's easier to fix the calculation that populates a column versus a calculation that's wrong and used everywhere. You're likely to find that 4 derived columns are useful here as you'd want to use the previous colon's position as the starting point for the third argument to FINDSTRING

因此,而不是工作

FINDSTRING(PhoneData, ":", FINDSTRING(PhoneData, ":" 1) + 1)

本来就是

FINDSTRING(PhoneData, ":", HomeColonPosition + 1)

只要知道该字符串中 4 个冒号的位置,我就可以找出电话号码的位置(也许).冒号+2(冒号和空格)的位置是起点然后出去12个字符.

Just knowing the position of the 4 colons in that string, I can figure out where the phone numbers are (maybe). The position of the colon + 2 (colon and the space) is the starting point and then go out 12 characters.

这种方法变得丑陋的地方,就像脚本方法一样,当数据不一致时.

Where this approach gets ugly, much as it did with the script approach is when that data isn't consistent.

这篇关于SSIS 派生列 - 在中断返回之间解析文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆