Windows批处理/从html网页解析数据 [英] Windows Batch / parse data from html web page

查看:389
本文介绍了Windows批处理/从html网页解析数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以使用Windows Batch从Web html页面解析数据?

Is it possible to parse data from web html page using windows batch?

假设我有一个网页:www.domain.com/data/page/1 页面源html:

let's say I have a web page: www.domain.com/data/page/1 Page source html:

...
<div><a href="/post/view/664654"> ....
....

在这种情况下,我需要从网页获取/post/view/664654.

In this case I would need get /post/view/664654 from web page.

我的想法是循环访问www.domain.com/data/page/1 ...#(到某个给定的数字)并提取所有/post/view的内容.然后,我将获得一个链接列表,然后从每个链接中提取href值(图像或视频).

My idea is to loop through www.domain.com/data/page/1 ... # (to some given number) and extract all the /post/view's. Then I would have a list of links, and from each of those links I would extract href values (either images or videos).

到目前为止,只有在使用wget知道确切链接时,我才成功下载图像或视频.但是我不知道如何(如果可能的话)解析html数据.

So far I was only successful in downloading image or video when I know exact link, using wget. But I don't know how (if possible at all) to parse html data.

修改

<body>
<nav>
    <section>links I dont need</section>
</nav>
<article>
    <section>links I need</section>
</article>

推荐答案

最好将结构化标记解析为分层对象,而不是将其抓取为纯文本.这样,您就不必那么依赖于要解析的数据的格式(无论它是否已缩小,间距是否已更改,等等).

It's better to parse structured markup as a hierarchical object, rather than scraping as flat text. That way you aren't so dependent upon the formatting of the data you're parsing (whether it's minified, spacing has changed, whatever).

批处理语言不是非常适合解析HTML,XML,JSON等标记语言.在这种情况下,使用混合脚本并从JScript或PowerShell方法中取用来刮擦您需要的数据.这是一个演示批处理+ JScript混合脚本的示例.用.bat扩展名保存并运行.

The batch language isn't terribly well-suited to parse markup language like HTML, XML, JSON, etc. In such cases, it can be extremely helpful to use a hybrid script and borrow from JScript or PowerShell methods to scrape the data you need. Here's an example demonstrating a batch + JScript hybrid script. Save it with a .bat extension and give it a run.

@if (@CodeSection == @Batch) @then
@echo off & setlocal

set "url=http://www.domain.com/data/page/1"

for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%url%"') do (
    rem // do something useful with %%I
    echo Link found: %%I
)

goto :EOF
@end // end batch / begin JScript hybrid code

// returns a DOM root object
function fetch(url) {
    var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
        DOM = WSH.CreateObject('htmlfile');

    XHR.open("GET",url,true);
    XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
    XHR.send('');
    while (XHR.readyState!=4) {WSH.Sleep(25)};
    DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
    DOM.write(XHR.responseText);
    return DOM;
}

var DOM = fetch(WSH.Arguments(0)),
    links = DOM.getElementsByTagName('a');

for (var i in links)
    if (links[i].href && /\/post\/view\//i.test(links[i].href))
        WSH.Echo(links[i].href);

这篇关于Windows批处理/从html网页解析数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆