Windows批处理/从html网页解析数据 [英] Windows Batch / parse data from html web page
问题描述
是否可以使用Windows Batch从Web html页面解析数据?
Is it possible to parse data from web html page using windows batch?
假设我有一个网页:www.domain.com/data/page/1 页面源html:
let's say I have a web page: www.domain.com/data/page/1 Page source html:
...
<div><a href="/post/view/664654"> ....
....
在这种情况下,我需要从网页获取/post/view/664654.
In this case I would need get /post/view/664654 from web page.
我的想法是循环访问www.domain.com/data/page/1 ...#(到某个给定的数字)并提取所有/post/view的内容.然后,我将获得一个链接列表,然后从每个链接中提取href值(图像或视频).
My idea is to loop through www.domain.com/data/page/1 ... # (to some given number) and extract all the /post/view's. Then I would have a list of links, and from each of those links I would extract href values (either images or videos).
到目前为止,只有在使用wget知道确切链接时,我才成功下载图像或视频.但是我不知道如何(如果可能的话)解析html数据.
So far I was only successful in downloading image or video when I know exact link, using wget. But I don't know how (if possible at all) to parse html data.
修改
<body>
<nav>
<section>links I dont need</section>
</nav>
<article>
<section>links I need</section>
</article>
推荐答案
最好将结构化标记解析为分层对象,而不是将其抓取为纯文本.这样,您就不必那么依赖于要解析的数据的格式(无论它是否已缩小,间距是否已更改,等等).
It's better to parse structured markup as a hierarchical object, rather than scraping as flat text. That way you aren't so dependent upon the formatting of the data you're parsing (whether it's minified, spacing has changed, whatever).
批处理语言不是非常适合解析HTML,XML,JSON等标记语言.在这种情况下,使用混合脚本并从JScript或PowerShell方法中取用来刮擦您需要的数据.这是一个演示批处理+ JScript混合脚本的示例.用.bat扩展名保存并运行.
The batch language isn't terribly well-suited to parse markup language like HTML, XML, JSON, etc. In such cases, it can be extremely helpful to use a hybrid script and borrow from JScript or PowerShell methods to scrape the data you need. Here's an example demonstrating a batch + JScript hybrid script. Save it with a .bat extension and give it a run.
@if (@CodeSection == @Batch) @then
@echo off & setlocal
set "url=http://www.domain.com/data/page/1"
for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%url%"') do (
rem // do something useful with %%I
echo Link found: %%I
)
goto :EOF
@end // end batch / begin JScript hybrid code
// returns a DOM root object
function fetch(url) {
var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
DOM = WSH.CreateObject('htmlfile');
XHR.open("GET",url,true);
XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
XHR.send('');
while (XHR.readyState!=4) {WSH.Sleep(25)};
DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
DOM.write(XHR.responseText);
return DOM;
}
var DOM = fetch(WSH.Arguments(0)),
links = DOM.getElementsByTagName('a');
for (var i in links)
if (links[i].href && /\/post\/view\//i.test(links[i].href))
WSH.Echo(links[i].href);
这篇关于Windows批处理/从html网页解析数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!