talend:csv列中间的换行符 [英] talend : newline character in middle of csv column

查看:79
本文介绍了talend:csv列中间的换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用tSoap组件获取数据,在该组件中,我将XML格式的结果作为逗号分隔的值来获取.

I am fetching data using tSoap component in which i am getting result in XML format as comma separated values. In which columns are separated by comma and rows are separated by '\n'.

之后,我使用tExtractXMLField组件从响应中提取数据.

After that i am using tExtractXMLField component for extracting data from the response.

但是在数据中,我在字符串中包含"\ n",将其视为新行.我尝试使用tReplace组件使用正则表达式删除引号内的\ n,但数据太大,导致导致StackOverflowError.

But in data i have '\n' within the strings which is treating it as a new row. I tried using tReplace component to remove \n within the quotes using regex but data is too large, result causing StackOverflowError.

我也尝试使用tNomalize组件使用CSV选项分隔行,但是问题仍然存在.

Also I tried using tNomalize component to separate the rows using CSV option, but the problem still persist.

您能帮我吗?预先感谢.

Can you please help me on this. Thanks in advance.

我从肥皂请求中得到的响应是:

Response which i am getting from the soap request is:

  <env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">
<env:Header/>
<env:Body>
<ns2:getReportResultCsvResponse xmlns:ns2="http://service.admin.ws.five9.com/">
<return>TIMESTAMP,CALL ID,NOTES
"Mon, 17 Apr 2017 10:05:38",4223519,
"Mon, 17 Apr 2017 10:05:40",4223520,
"Mon, 17 Apr 2017 10:05:41",4223521,"Alexandria..
Monday -- 55 partial
Bal -- 224 May 1
Visa"
"Mon, 17 Apr 2017 10:05:42",4223522,
"Mon, 17 Apr 2017 10:05:43",4223523,
"Mon, 17 Apr 2017 10:11:04",4223524,
"Mon, 17 Apr 2017 10:05:43",4223524,
"Mon, 17 Apr 2017 10:05:45",4223525,</return>
</ns2:getReportResultCsvResponse>
</env:Body>
</env:Envelope>

在这里我们可以看到"notes"列中的数据中包含"\ n"引号之间,这会导致提取数据的问题.能您请告诉我如何解决此问题.

Here as we can see "notes" column having data which have '\n' in it in between the quotes, and it is causing issue for extracting data. Can you please tell me how can i resolve this issue.

推荐答案

实际上,您的文件是嵌入到XML文件中的CSV文件.
由于"notes"字段位于之间",因此一种解决方案是将文件转换为纯CSV,然后借助适当的"CSV选项",自动消除"\ n"问题.

In fact your file is a CSV file embedded into a XML file.
Because "notes" field is enclosed between ", a solution is to transform the file to pure CSV then, thanks to the appropriate "CSV option", the problem of "\n" disappears automagically.

这是工作的样子:

tFileInputFullRow读取输入文件,因为它默认位于一个命名为"line"的单个字段中.只需将Header设置为4并将Footer设置为3即可忽略大多数XML部分(假设文件结构始终相同).

tFileInputFullRow read the input file as it come in a single field nammed "line" by default. Just set Header to 4 and Footer to 3 to ignore most of the XML part (supposing the file structure is always the same).

将结果传递给tMap只是为了删除剩余的XML"return"标记,该标记未被上一步删除(因为不在单独的行中).
这是带有replaceAll的tMap,用于删除此标记:

Pass the result to tMap just to remove the remaining XML "return" tag not removed by the previous step (because not on a separate line).
Here is the tMap with the replaceAll used to remove this tag:

在tMap之后,使用tFileOutputDelimited将流传递到纯CSV文件.让所有选项都带有默认值.

After the tMap, pass the flow to a pure CSV file using tFileOutputDelimited. Let all options with the propsed default value.

现在,使用tFileInputDelimited启动第二个子作业以读取CSV文件.用3列"Timestamp","CallId"和"Notes"定义模式.将字段分隔符设置为,",然后单击"CSV options",然后单击魔术".

Now, start a 2nd subjob with tFileInputDelimited to read the CSV file. Define the schema with the 3 columns "Timestamp", "CallId" and "Notes". Set the field separator to "," and the magic, click on "CSV options", nothing else.

要仅在注释"字段中显示带有"\ n"的记录,我将Header设置为3并将Limit设置为1(tFileInputDelimited后仅1行的原因).
这是结果:

To display only the record with "\n" in "notes" field, I set the Header to 3 and the Limit 1 (the reason why there is just 1 row after the tFileInputDelimited).
Here is the result:

如您所见,由于"\ n"字符,字段"notes"按预期分派了4行.

As you can see, the field "notes" is dispatched on 4 lines as expected because of the "\n" characters.

关于,
TRF

Regards,
TRF

这篇关于talend:csv列中间的换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆