使用Talend的FTP,只获取最新的文件? [英] FTP using Talend, get only most recent file?

查看:525
本文介绍了使用Talend的FTP,只获取最新的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Talend作业,我需要从sFTP服务器上拉下一个XML文件,然后处理成Oracle数据库。 XML提取的日期在文件名中,例如FileNameHere_Outbound_201407092215.xml,我相信它是yyyyMMddhhmm格式。所有文件的FileNameHere开始部分相同。我需要能够从文件名的末尾读取日期,并只从服务器上拉下那个日期来处理。



我不确定如何用FTP做这个。我以前使用tFilelist按日期降序排列项目,但这不是FTP的选项。我知道它可能有一些Java参与如何将文件名的一部分拉出来,但我不是很懂Java的人。



有没有人对如何从FTP下载最新文件有所了解?

解决方案

调色板上有一个tFTPFileList组件。这应该给你一个FTP位置上所有文件的列表。然后从这里开始解析出可以用正则表达式完成的时间戳,或者根据你感觉更舒服的方式对它进行子串处理。然后它是只是一个按提取的时间戳进行排序的例子,然后给你最新的文件名,这样你就可以去获取这个特定的文件。



这是一个过度概述辛苦的方式来完成这个工作,但它的工作。你应该能够很容易地调整它:


在上面的工作设计中,我已经选择了tFileList而不是tFTPFileList,因为我没有示例FTP位置来玩在这里测试。前提保持不变,但由于能够按修改日期排序(在其他选项中),所以这对于真正的tFileList来说是毫无意义的。



我们首先运行tFileList / tFTPFileList组件遍历所有文件(可以将这些文件掩盖起来,以限制您在此返回的内容)。然后,我们通过迭代读取到tFixedFlowInput组件,该组件允许我们从全局映射中检索值,因为tFileList / tFTPFileList遍历每个文件:



我已经列出了tFileList提供的所有内容(您可以通过按 ctrl + space 来查看选项),但只需要文件名和潜在的文件路径或文件目录。从这里开始,我们将所有东西都放到一个带有tBufferOutput组件的缓冲区中,这样我们就可以收集每个位置的迭代。



一旦tFileList / tFTPFileList迭代了每个文件该目录然后通过OnSubjobOk链接触发下一个子作业,我们通过以tBufferInput组件读取已完成的缓冲区开始。在这一点上,我已经开始在整个流程中分散tLogRow组件,所以我可以更好地在每个步骤中可视化数据。

然后,我们使用tExtractRegexFields组件提取日期时间戳:




在这里,我使用以下正则表达式^。+?_ Outbound _([0-9] {12 })\\.xml $来捕获日期时间戳。它依赖于文件名是任意字符的组合,后面跟着字符串文字 _Outbound _ ,然后是我们想要捕获的日期时间戳(由12个数字字符),然后以 .xml 结束。

我们还在我们的模式中添加一列以适应捕获的日期时间戳如下所示:





由于额外列是 yyyyMMddhhmm 格式的日期时间戳,因此我们可以直接在此处指定此值,使用它作为日期对象。



从这里我们简单地按提取的日期时间戳列降序排序,然后使用tSampleRow只取第一个根据组件配置指导原则,数据流的一行。

为了完成这项工作,您将输出目标文件路径到globalMap(或者i na tJavaRow或使用tFlowToIterate将自动为您执行此操作),然后在tFTPFileGet的文件掩码设置中使用globalMap存储的文件路径:


I have a Talend job that I need to pull down an XML file from an sFTP server to then be processed into an Oracle database. The date of the XML extraction is in the file name, for example "FileNameHere_Outbound_201407092215.xml", which I believe is yyyyMMddhhmm formatting. The beginning portion where "FileNameHere" is the same for all the files. I need to be able to read the date from the end of the file name and only pull that one down from the server to be processed.

I am not sure how to do this with FTP. I've previously used tFilelist to order the items by date descending, but that is not an option with FTP. I know it probably has some Java involved in how to pull the portion of the File Name out, but I'm not very Java-literate. I can manage though with a bit of assistance.

Does anyone have any insight on how to only download the most recent file from an FTP?

解决方案

There's a tFTPFileList component on the palette. That should give you a list of all the files on the FTP location. From here you then want to parse out the time stamp which could be done with a regular expression or alternatively by substringing it depending on which you feel more comfortable with.

Then it's just a case of sorting by the extracted time stamp and then that gives you the newest file name so you can then go fetch that specific file.

Here's an outline of an overly laborious way to get this done but it works. You should be able to tweak this easily yourself too:

In the above job design I've gone for a tFileList rather than a tFTPFileList because I don't have an example FTP location to play with for testing here. The premise stays the same although this would be pointless with a real tFileList due to the ability to sort by modified date (among other options).

We start off by running the tFileList/tFTPFileList component to iterate through all the files (it's possible to file mask these too to limit what you return here) in the location. We then read this in iteratively to a tFixedFlowInput component which allows us to retrieve the values from the globalMap as the tFileList/tFTPFileList iterates through each file:

I've listed everything that the tFileList provides (you can see the options by pressing ctrl+space) but you only really need the file name and potentially the file path or file directory. From here we then throw everything into a buffer with a tBufferOutput component so that we can gather every iteration of the location.

Once the tFileList/tFTPFileList has iterated through every file in the directory it then triggers the next sub job with an OnSubjobOk link where we start by reading the completed buffer back in with a tBufferInput component. At this point I've started scattering tLogRow components throughout the flow so I can better visualise the data at each step.

After this we then use a tExtractRegexFields component to extract the date time stamp from the file name:

Here, I am using the following regex "^.+?_Outbound_([0-9]{12})\\.xml$" to capture the date time stamp. It relies on the file name being a combination of any characters, followed by the string literal _Outbound_, then followed by the date time stamp that we want to capture (which is represented by 12 numeric characters) and then finished with .xml.

We also add a column to our schema to accommodate the captured date time stamp like so:

As the extra column is a date time stamp of the form yyyyMMddhhmm we can specify this directly here and use it as a date object from then on.

From here we simply sort by date descending on the extracted date time stamp column and then use a tSampleRow to take only the first row of the flow of data as per the guidelines on the component configuration.

To finish this job you would then output the target file path to the globalMap (either in a tJavaRow or using a tFlowToIterate that will automatically do this for you) and then use the globalMap stored file path in the tFTPFileGet's file mask setting:

这篇关于使用Talend的FTP,只获取最新的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆