FTP服务器的递归list.files [英] Recursive list.files for FTP-Server

查看:94
本文介绍了FTP服务器的递归list.files的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有list.files(path, recursive=TRUE)的ftp版本?

我想获取该FTP服务器上子目录中所有ZIP存档的URL

I want to get all the URL's of the ZIP-Archieves in subdirectories on this FTP-Server

url <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/"

所以我想获取此目录中所有文件的列表:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany /climate/hourly/wind/recent/ 以及
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany /climate/hourly/air_temperature/historical/ 等等

so i want to get a list of all files in this directory:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/wind/recent/ as well as
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/air_temperature/historical/ and so on

使用RCurl,我设法下载了该目录的目录,但没有获得所有子目录中所有zip归档文件的完整列表. 除了循环遍历目录并逐一获取目录之外,还有什么建议吗?

With RCurl i managed to download the dirlist of this directory but to not to get a comprehensive list of all zip-archieves in all subdirectories. Any advice other than looping trough the directories and getting the dirlists one by one?

RCurl代码:

dwd_dirlist <- function(url, full = TRUE){
  dir <- unlist(
    strsplit(
      getURL(url,
             ftp.use.epsv = FALSE,
             dirlistonly = TRUE),
      "\n")
    )
  if(full) dir <- paste0(url, dir)
  return(dir)
}

推荐答案

如果您的系统上安装了 lftp 实用程序,则可以使用其find命令以递归方式列出指定目录下的文件.这是文档链接find的描述在顶部附近.

If you have the lftp utility installed on your system, then you can use its find command to recursively list files underneath a specified directory. Here's a link to the documentation; the description for find is near the top.

不幸的是,正如您从文档中所看到的,与常见的Unix find实用程序不同,lftpfind命令根本不支持很多选项.仅--max-depth--list(对于较长的清单),因此您不能使用-name-regex等.find实用程序通常提供的谓词.另一方面,lftp确实支持一个非常不寻常但功能强大的功能,它允许您将输出通过管道传输到本地工具,例如,您可以将find输出从管道内部传输到本地grep. lftp命令行.当然,没有什么可以阻止您在shell管道中进行grep或在Rland中进行过滤.这是一个使用lftp管道的示例(如您所见,这种方法的缺点是多个转义变得相当复杂):

Unfortunately, as you can see from the documentation, and unlike the common Unix find utility, lftp's find command doesn't support very many options at all; only --max-depth and --list (for a long listing), so you can't use the -name, -regex, etc. predicates that the find utility normally provides. On the other hand, lftp does support a very unusual but powerful feature in that it allows you to pipe output to local tools, so you could, for example, pipe the find output to your local grep from inside the lftp command-line. Of course, there's nothing stopping you from grepping in a shell pipeline, or filtering back in Rland. Here's an example using an lftp pipeline (as you can see, a disadvantage of this approach is that the multiple levels of escaping get pretty convoluted):

url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
zips <- system(paste0('lftp ',url,' <<<\'find| grep "\\\\.zip$"; exit;\';'),intern=T);
zips;
##    [1] "./air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip"
##    [2] "./air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip"
##    [3] "./air_temperature/historical/stundenwerte_TU_00052_19760101_19880101_hist.zip"
##    [4] "./air_temperature/historical/stundenwerte_TU_00071_20091201_20141231_hist.zip"
##
## ... snip ...
##
## [6616] "./wind/recent/stundenwerte_FF_15207_akt.zip"
## [6617] "./wind/recent/stundenwerte_FF_15214_akt.zip"
## [6618] "./wind/recent/stundenwerte_FF_15444_akt.zip"
## [6619] "./wind/recent/stundenwerte_FF_15520_akt.zip"

此外,仅此而已,如果您想要另一种方法,我编写了一个函数,该函数可以使用正则表达式解析ls -l列表的输出,并返回data.frame中的所有字段.一个简单的修改使其可以使用lftp在ftp上工作:

Also, just for the heck of it, if you want another approach, I've written a function that can parse the output of an ls -l listing using regular expressions, returning all fields in a data.frame. A simple modification allows it to work over ftp using lftp:

longListing <- function(url='',recursive=F,all=F) {
    ## returns a data.frame of long-listing fields
    ## requires lftp for ftp support

    ## validate arguments
    url <- as.character(url);
    if (length(url) != 1L) stop('url argument must have length 1.');
    recursive <- as.logical(recursive);
    if (length(recursive) != 1L) stop('recursive argument must have length 1.');
    all <- as.logical(all);
    if (length(all) != 1L) stop('all argument must have length 1.');

    ## escape and single-quote url, or leave empty for pwd if empty
    urlEsc <- if (url == '') '' else paste0('\'',sub("'","'\\''",url),'\'');

    ## construct ls command with options; identical between local ls and lftp ls
    ## technically lftp ls doesn't require -l to get a long listing, but it accepts it
    lsCmd <- paste0('ls -l',if (recursive) ' -R',if (all) ' -A');

    ## run system command to get long-listing output lines
    if (substr(url,0L,6L) == 'ftp://') { ## ftp
        output <- system(paste0('lftp ',urlEsc,' <<<\'',lsCmd,'; exit;\';'),intern=T);
    } else { ## local
        output <- system(paste0(lsCmd,' ',urlEsc,';'),intern=T);
    }; ## end if

    ## define regexes for parsing the output
    ## note: accept question marks for items whose metadata cannot be read
    sp0RE <- '\\s*';
    sp1RE <- '\\s+';
    typeRE <- '([?dlcbps-])';
    rRE <- '([?r-])';
    wRE <- '([?w-])';
    xRE <- '([?xsStT-])';
    aclRE <- '([?+@]*)';
    permRE <- paste0(typeRE,rRE,wRE,xRE,rRE,wRE,xRE,rRE,wRE,xRE,aclRE);
    linksRE <- '(\\?|[0-9]+)';
    ocRE <- '[a-zA-Z_0-9.$+-]';
    ocsRE <- '[a-zA-Z_0-9 .$+-]'; ## badly-behaving names can have spaces; non-greedy will prevent excessive gobbling
    ownerRE <- paste0('(\\?|',ocRE,'|',ocRE,ocsRE,'*?',ocRE,')');
    groupRE <- ownerRE; ## same compatibility rules as owner
    sizeRE <- '(?:\\?|(?:([0-9]+),\\s*)?([0-9]+))'; ## major, minor for special files, plain size for rest
    monthRE <- '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)';
    dayRE <- '([0-9]+)';
    timeRE <- '([0-9]{2}:[0-9]{2}|[0-9]+)'; ## could be year
    dtRE <- paste0('(?:\\?|',monthRE,sp1RE,dayRE,sp1RE,timeRE,')');
    nameRE <- '(.*?)'; ## make non-greedy to allow target to be captured, if present
    targetRE <- '(?:\\s+->\\s+(.*))?'; ## target is optional; shown on some platforms, e.g. Cygwin
    recordRE <- paste0(
        '^'
        ,permRE,sp1RE
        ,linksRE,sp1RE
        ,ownerRE,sp1RE
        ,groupRE,sp1RE
        ,sizeRE,sp1RE
        ,dtRE,sp1RE
        ,nameRE,targetRE ## target is optional; targetRE defines its own whitespace separation
        ,sp0RE,'$' ## ignore trailing whitespace
    );

    ## get indexes of listing records
    recordIndexes <- grep(recordRE,output);

    ## get indexes of blanks and directory headers for maximally robust matching
    blankIndexes <- grep('^\\s*$',output);
    headerIndexes <- grep(':$',output); ## questionable specificity

    ## pare headers down to those with preceding blank
    headerIndexes <- headerIndexes[(headerIndexes-1)%in%c(0L,blankIndexes)]; ## include zero for possible first-line header

    ## match recordIndexes into headerIndexes to look up parent path; direct children will be zero
    recordHeaderIndexes <- findInterval(recordIndexes,headerIndexes);

    ## derive parent paths with trailing slash, or empty string for direct children
    parentPaths <- c('',sub(':','/',output[headerIndexes]))[recordHeaderIndexes+1L];
    parentPaths <- sub('^\\./','',parentPaths); ## for aesthetics

    ## match record lines and extract capture groups
    reg <- regmatches(output[recordIndexes],regexec(recordRE,output[recordIndexes]));

    ## build data.frame with reg fields
    ret <- data.frame(type=sapply(reg,`[`,2L),stringsAsFactors=F); ## start with type to set the row count
    i <- 3L;
    ## note: size is actually minor for character- and block-special files
    for (cn in c('ur','uw','ux','gr','gw','gx','or','ow','ox','acl','links','owner','group','major','size','month','day','time','path','target')) {
        ret[[cn]] <- sapply(reg,`[`,i);
        i <- i+1L;
    }; ## end for

    ## prepend parent paths to listing paths
    ret$path <- paste0(parentPaths,ret$path);

    ret;

}; ## end longListing()

这是我在系统上创建的特殊文件目录中的演示:

Here's a demo of it on a directory of special files I created on my system:

longListing();
##    type ur uw ux gr gw gx or ow ox acl links owner group major size month day  time                      path            target
## 1     d  r  w  x  r  -  -  r  -  -   +     1  user  None          0   Feb  27 08:21                       dir
## 2     d  r  w  x  r  w  x  r  w  x   +     1  user  None          0   Feb  27 08:21        dir-other-writable
## 3     d  r  w  x  r  -  -  r  -  T   +     1  user  None          0   Feb  27 08:21                dir-sticky
## 4     d  r  w  x  r  w  x  r  w  t   +     1  user  None          0   Feb  27 08:21 dir-sticky-other-writable
## 5     -  r  w  -  r  -  -  r  -  -         2  user  None          0   Feb  27 08:21                      file
## 6     -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21          file-archive.tar
## 7     -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21            file-audio.mp3
## 8     b  r  w  -  r  w  -  r  w  -         1  user  None     0    1   Feb  27 08:21        file-block-special
## 9     c  r  w  -  r  w  -  r  w  -         1  user  None     0    1   Feb  27 08:21    file-character-special
## 10    -  r  w  x  r  w  x  r  w  x         1  user  None         12   Feb  27 08:21                  file-exe
## 11    p  r  w  -  r  w  -  r  w  -         1  user  None          0   Feb  27 08:21                 file-fifo
## 12    -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21            file-image.bmp
## 13    -  r  w  -  r  w  S  r  -  -         1  user  None          0   Feb  27 08:21               file-setgid
## 14    -  r  w  x  r  w  s  r  -  x         1  user  None          0   Feb  27 08:21           file-setgid-exe
## 15    -  r  w  S  r  w  -  r  -  -         1  user  None          0   Feb  27 08:21               file-setuid
## 16    -  r  w  s  r  w  x  r  -  x         1  user  None          0   Feb  27 08:21           file-setuid-exe
## 17    s  r  w  -  r  w  -  r  -  -         1  user  None          0   Feb  27 08:21               file-socket
## 18    l  r  w  x  r  w  x  r  w  x         1  user  None          4   Feb  27 08:21               ln-existing              file
## 19    -  r  w  -  r  -  -  r  -  -         2  user  None          0   Feb  27 08:21                   ln-hard
## 20    l  r  w  x  r  w  x  r  w  x         1  user  None         17   Feb  27 08:21           ln-non-existing file-non-existing

您网站上的演示

url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
ll <- longListing(url,T,T);
ll;
##      type ur uw ux gr gw gx or ow ox acl links owner   group major    size month day  time                                                                                                  path target
## 1       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                       air_temperature
## 2       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Sep  25  2014                                                                                            cloudiness
## 3       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Nov  13  2014                                                                                         precipitation
## 4       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Nov  13  2014                                                                                              pressure
## 5       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                      soil_temperature
## 6       d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd         12288   Dec  15 11:52                                                                                                 solar
## 7       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                                   sun
## 8       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Apr  17  2015                                                                                                  wind
## 9       d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd        114688   Oct  15 12:35                                                                            air_temperature/historical
## 10      d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd        151552   Dec   4 10:28                                                                                air_temperature/recent
## 11      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         68727   Jan  26 09:55                air_temperature/historical/BESCHREIBUNG_obsgermany_climate_hourly_tu_historical_de.pdf
## 12      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         68600   Jan  26 09:55                 air_temperature/historical/DESCRIPTION_obsgermany_climate_hourly_tu_historical_en.pdf
## 13      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd        123634   Mar  27  2015                                 air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt
## 14      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd       2847045   Mar  27  2015                           air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip
## 15      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd        359517   Mar  27  2015                           air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip
##
## ... snip ...
##
## 6683    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         65633   Feb  27 10:26                                                             wind/recent/stundenwerte_FF_15207_akt.zip
## 6684    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         66910   Feb  27 10:21                                                             wind/recent/stundenwerte_FF_15214_akt.zip
## 6685    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         64525   Feb  27 10:19                                                             wind/recent/stundenwerte_FF_15444_akt.zip
## 6686    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         23717   Feb  27 10:21                                                             wind/recent/stundenwerte_FF_15520_akt.zip

您可以轻松提取仅zip文件名:

You could extract just the zip file names easily:

zips <- ll$path[ll$type=='-' & grepl('\\.zip$',ll$path)];
length(zips);
## [1] 6619

这篇关于FTP服务器的递归list.files的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆