从网页(Windows)递归下载zip文件 [英] Recursively download zip files from webpage (Windows)

查看:163
本文介绍了从网页(Windows)递归下载zip文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以从网页下载所有的zip文件,而不必一次指定一个链接。



我想下载所有月度帐户的zip文件 http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html



我正在使用Windows 8.1,R3.1.1。在PC上没有 wget ,所以不能使用递归调用。



替代方法:
作为解决方法,我已经尝试下载网页文本本身。然后,我想提取每个zip文件的名称,然后我可以在循环中传递到 download.file 。但是,我正在努力提取名称。

  pth < - http://download.companieshouse.gov.uk /en_monthlyaccountsdata.html

temp< - tempfile()
download.file(pth,temp)
dat< - readLines(temp)
unlink

g< - dat [grepl(accounts_monthly,tolower(dat))]

g 包含带有文件名的字符串,以及其他字符。

  g 
[1]< li>< a href = \Accounts_Monthly_Data-September2013.zip\> Accounts_Monthly_Data-September2013.zip(775Mb)< / a> < /锂>中
[2]< li>< a href = \Accounts_Monthly_Data-October2013.zip\> Accounts_Monthly_Data-October2013.zip(622Mb)< / a>< / li>

我想提取文件的名称 Accounts_Monthly_Data-September2013.zip 等等,但是我的正则表达式是相当可怕的(见自己)

  gsub(。 * \\>(\\w + \\.zip)\\s +,\\1,g)

数据

  g<  -  c( li>< a href = \Accounts_Monthly_Data-September2013.zip\> Accounts_Monthly_Data-September2013.zip(775Mb)< / a>< / li>,
< li>< ; a href = \Accounts_Monthly_Data-October2013.zip\> Accounts_Monthly_Data-October2013.zip(622Mb)< / a>< / li>


解决方案

使用 XML p>

  pth < - http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html
库(XML )
doc< - htmlParse(pth)
myfiles< - doc [// a [contains(text(),'Accounts_Monthly_Data')],fun = xmlAttrs]
fileURLS< - file.path(http:// download companyhouse.gov.uk,myfiles)
mapply(download.file,url = fileURLS,destfile = myfiles)

// a [contains(text(),'Accounts_Monthly_Data')]是一个 XPATH 表达式。它指示XML包选择所有包含文本Accounts_Monthly_Data的节点( a )。这个结果是一个节点列表。 fun = xmlAttrs 参数然后告诉XML包将这些节点传递给 xmlAttrs 函数。此功能从xml节点剥离属性。在这种情况下,锚只有一个属性是 href ,这是我们正在寻找的。

Is it possible to download all zip files from a webpage without specifying the individual links one at a time.

I would like to download all monthly account zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html.

I am using Windows 8.1, R3.1.1. I do not have wget on the PC so can't use a recursive call.

Alternative: As a workaround i have tried downloading the webpage text itself. I would then like to extract the name of each zip file which i can then pass to download.file in a loop. However, i am struggling with extracting the name.

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"

temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)

g <- dat[grepl("accounts_monthly", tolower(dat))]

g contains character strings with the file names, amongst other characters.

g
 [1] "                    <li><a href=\"Accounts_Monthly_Data-September2013.zip\">Accounts_Monthly_Data-September2013.zip  (775Mb)</a></li>"
 [2] "                    <li><a href=\"Accounts_Monthly_Data-October2013.zip\">Accounts_Monthly_Data-October2013.zip  (622Mb)</a></li>" 

I would like to extract the name of the files Accounts_Monthly_Data-September2013.zip and so on, but my regex is quite terrible (see for yourself)

    gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)

data

g <- c("                    <li><a href=\"Accounts_Monthly_Data-September2013.zip\">Accounts_Monthly_Data-September2013.zip  (775Mb)</a></li>", 
"                    <li><a href=\"Accounts_Monthly_Data-October2013.zip\">Accounts_Monthly_Data-October2013.zip  (622Mb)</a></li>"
)

解决方案

Use the XML package:

pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)

"//a[contains(text(),'Accounts_Monthly_Data')]" is an XPATH expression. It instructs the XML package to select all nodes that are anchors( a ) containing text "Accounts_Monthly_Data". This results is a list of nodes. The fun = xmlAttrs argument then tells the XML package to pass these nodes to the xmlAttrs function. This function strips the attributes from xml nodes. The anchor only have one attribute in this case the href which is what we are looking for.

这篇关于从网页(Windows)递归下载zip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆