如何仅根据修改后的时间从tftpfilelist中提取最近两天的最新文件,而又不存储在tbufferoutput组件-talend作业中 [英] how to just extract the last 2 days recent files from tftpfilelist based on modified time without storing in a tbufferoutput component-talend job

查看:57
本文介绍了如何仅根据修改后的时间从tftpfilelist中提取最近两天的最新文件,而又不存储在tbufferoutput组件-talend作业中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我正在遍历文件夹中所有可用的5k文件,并将它们存储在tbufferoutput中,并通过使用tbufferinput对其进行读取,并根据mtime desc(在ftp站点中的修改时间)对它们进行降序排序并仅提取前10个文件.

As of now i am iterating through all the 5k files available in the folder and store them in a tbufferoutput and read through them by using tbufferinput and sorting them based on mtime desc(modified time in the ftp site) in the descending order and extract the top 10 files only.

因为它一次又一次地遍历所有5k文件,这很耗时,并且导致远程ftp站点出现不必要的延迟问题.

Since its iterating through all the 5k files at once its time consuming and causing unnecessary latency issues with the remote ftp site.

我想知道是否还有其他简单的方法可以不迭代而直接从ftp站点获取最新的前10个文件,并根据mtime desc对它们进行排序并对其执行操作?

i was wondering if there is any other simple way without iterating just get the latest top 10 files from the ftp site directly and sort them based on mtime desc and perform operations with them?

我现在的天才工作流程看起来像这样,建议其他方法可以更好地优化工作绩效!

My talend job flow looks like this at the moment,would advise any other methods that could optimize the performance of the job in a much better way!

基本上,我不想迭代并遍历ftp站点中的所有文件,而是直接从远程ftp:tftpfilelist获取前十名并在db中执行检查并稍后下载

Basically i dont want to iterate and run through all the files in the ftp site,instead directly get the top 10 from the remote ftp :tftpfilelist and perform checks in db and download them later

是否始终没有迭代,仅通过按desc顺序使用修改后的时间戳,我能否仅获取最新的10个文件?-这是简单的问题 或者 我想从远程ftp站点中提取最近3天的文件.

IS THERE ANYWAY WITHOUT ITERATING ,CAN I JUST GET THE LATEST 10 FILES just by using modified timestamp in desc order alone?-This is the question in short OR I want to extract the LAST 3 days files from the remote ftp site.

文件名的格式为:A_B_C_D_E_20200926053617.csv

Filename is in this format:A_B_C_D_E_20200926053617.csv

方法B:JAVA, 我尝试使用如下的tjava代码:对于流B:

Approach B:WITH JAVA, I tried using the tjava code as below: for the flow B:

Date lastModifiedDate = TalendDate.parseDate("EEE MMM dd HH:mm:ss zzz yyyy", row2.mtime_string);

Date current_date = TalendDate.getCurrentDate();

System.out.println(lastModifiedDate);

System.out.println(current_date);
System.out.println(((String)globalMap.get("tFTPFileList_1_CURRENT_FILE")));

if(TalendDate.diffDate(current_date, lastModifiedDate,"dd") <= 1) {

System.out.println

output_row.abs_path = input_row.abs_path;

System.out.println(output_row.abs_path);
}

现在tlogrow3正在全部打印NULL值,建议

Now the tlogrow3 is printing NULL values all over,please suggest

推荐答案

定义3个上下文变量:

在tJava中,计算3天(从当前日期开始)的掩码(带通配符):

in tJava, compute the mask (with wildcard) for the 3 days (starting at the current date) :

Date currentDate = TalendDate.getCurrentDate();
Date currentDateMinus1 = TalendDate.addDate(currentDate, -1, "dd");
Date currentDateMinus2 = TalendDate.addDate(currentDate, -2, "dd");

context.mask1 ="*" + TalendDate.formatDate("yyyyMMdd", currentDate) + "*.csv";
context.mask2 ="*" + TalendDate.formatDate("yyyyMMdd", currentDateMinus1) + "*.csv";
context.mask3 ="*" + TalendDate.formatDate("yyyyMMdd", currentDateMinus2) + "*.csv";

然后在tFTPFileList中,使用3个上下文变量作为文件掩码:

then in the tFTPFileList, use the 3 context variables for filemask :

仅检索今天和前两天的文件.

to retrieve the files only from today and the 2 previous day.

这篇关于如何仅根据修改后的时间从tftpfilelist中提取最近两天的最新文件,而又不存储在tbufferoutput组件-talend作业中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆