从网页(Windows)递归下载zip文件 [英] Recursively download zip files from webpage (Windows)
问题描述
是否可以从网页下载所有的zip文件,而不必一次指定一个链接。
我想下载所有月度帐户的zip文件 http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html 。
我正在使用Windows 8.1,R3.1.1。在PC上没有 wget
,所以不能使用递归调用。
替代方法:
作为解决方法,我已经尝试下载网页文本本身。然后,我想提取每个zip文件的名称,然后我可以在循环中传递到 download.file
。但是,我正在努力提取名称。
pth < - http://download.companieshouse.gov.uk /en_monthlyaccountsdata.html
temp< - tempfile()
download.file(pth,temp)
dat< - readLines(temp)
unlink
g< - dat [grepl(accounts_monthly,tolower(dat))]
g
包含带有文件名的字符串,以及其他字符。
g
[1]< li>< a href = \Accounts_Monthly_Data-September2013.zip\> Accounts_Monthly_Data-September2013.zip(775Mb)< / a> < /锂>中
[2]< li>< a href = \Accounts_Monthly_Data-October2013.zip\> Accounts_Monthly_Data-October2013.zip(622Mb)< / a>< / li>
我想提取文件的名称 Accounts_Monthly_Data-September2013.zip
等等,但是我的正则表达式是相当可怕的(见自己)
gsub(。 * \\>(\\w + \\.zip)\\s +,\\1,g)
数据
g< - c( li>< a href = \Accounts_Monthly_Data-September2013.zip\> Accounts_Monthly_Data-September2013.zip(775Mb)< / a>< / li>,
< li>< ; a href = \Accounts_Monthly_Data-October2013.zip\> Accounts_Monthly_Data-October2013.zip(622Mb)< / a>< / li>
)
使用 XML
p>
pth < - http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html
库(XML )
doc< - htmlParse(pth)
myfiles< - doc [// a [contains(text(),'Accounts_Monthly_Data')],fun = xmlAttrs]
fileURLS< - file.path(http:// download companyhouse.gov.uk,myfiles)
mapply(download.file,url = fileURLS,destfile = myfiles)
// a [contains(text(),'Accounts_Monthly_Data')]
是一个 XPATH 表达式。它指示XML包选择所有包含文本Accounts_Monthly_Data的节点( a
)。这个结果是一个节点列表。 fun = xmlAttrs
参数然后告诉XML包将这些节点传递给 xmlAttrs
函数。此功能从xml节点剥离属性。在这种情况下,锚只有一个属性是 href
,这是我们正在寻找的。 p>
Is it possible to download all zip files from a webpage without specifying the individual links one at a time.
I would like to download all monthly account zip files from http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html.
I am using Windows 8.1, R3.1.1. I do not have wget
on the PC so can't use a recursive call.
Alternative:
As a workaround i have tried downloading the webpage text itself. I would then like to extract the name of each zip file which i can then pass to download.file
in a loop. However, i am struggling with extracting the name.
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
temp <- tempfile()
download.file(pth,temp)
dat <- readLines(temp)
unlink(temp)
g <- dat[grepl("accounts_monthly", tolower(dat))]
g
contains character strings with the file names, amongst other characters.
g
[1] " <li><a href=\"Accounts_Monthly_Data-September2013.zip\">Accounts_Monthly_Data-September2013.zip (775Mb)</a></li>"
[2] " <li><a href=\"Accounts_Monthly_Data-October2013.zip\">Accounts_Monthly_Data-October2013.zip (622Mb)</a></li>"
I would like to extract the name of the files Accounts_Monthly_Data-September2013.zip
and so on, but my regex is quite terrible (see for yourself)
gsub(".*\\>(\\w+\\.zip)\\s+", "\\1", g)
data
g <- c(" <li><a href=\"Accounts_Monthly_Data-September2013.zip\">Accounts_Monthly_Data-September2013.zip (775Mb)</a></li>",
" <li><a href=\"Accounts_Monthly_Data-October2013.zip\">Accounts_Monthly_Data-October2013.zip (622Mb)</a></li>"
)
Use the XML
package:
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
library(XML)
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles)
mapply(download.file, url = fileURLS, destfile = myfiles)
"//a[contains(text(),'Accounts_Monthly_Data')]"
is an XPATH expression. It instructs the XML package to select all nodes that are anchors( a
) containing text "Accounts_Monthly_Data". This results is a list of nodes. The fun = xmlAttrs
argument then tells the XML package to pass these nodes to the xmlAttrs
function. This function strips the attributes from xml nodes. The anchor only have one attribute in this case the href
which is what we are looking for.
这篇关于从网页(Windows)递归下载zip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!