使用AWK / Grep / Bash从HTML中提取数据 [英] Using AWK/Grep/Bash to extract data from HTML

查看:147
本文介绍了使用AWK / Grep / Bash从HTML中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图让一个Bash脚本从HTML页面中提取结果。
我用Curl获得了页面的内容,但下一步是解析输出,这是有问题的。



页面的有趣内容看起来像这样:

 < div class =result> 
...
< div class =item>
< div class =item_title> ITEM 1< / div>
< / div>
...
< div class =item_desc>
项目描述1
< / div>
...
< / div>
< div class =result>
...
< div class =item>
< div class =item_title> ITEM 2< / div>
< / div>
...
< div class =item_desc>
项目描述2
< / div>
...
< / div>

我想输出如下内容:

  ITEM1;商品说明1 
商品2;商品说明2

我对Grep有一点了解,但是我无法想到让它在这里工作,也有人告诉我使用Awk,这似乎最适合这种任务。



感谢您的帮助。

非常感谢。


$ b $

一个简单的处理HTML的简单程序,松散地,没有验证,很容易被HTML中的变体混淆。 b

sed.script



  / *< div class =item_title> \(。* \ )< \ / div> / {s // \1 /; H; } 
/ *< div class =item_desc> /,/< \ / div> / {
/< div class =item_desc> / d
/< \ / div> / d
s / ^ * //
G
s / \(。* \)\\\
\(。* \)/ \\ \\ 2; \ 1 / p
}

第一行符合项目标题行。 s /// 命令仅捕获< div ...> < / DIV> ; h 将其复制到保存空间(内存)中。



脚本的其余部分匹配项目描述< div> 及其< / div> 。前两行删除(忽略)< div> < / div> 行。 s /// 删除前导空格; G 在换行符之后将保留空间附加到模式空间; s /// p 捕获换行符(描述)之前的部分和换行符之后的部分(来自保存空间的标题),并用标题替换它们



示例



  $ sed -n -f sed.script items.html 
项目1;项目描述1
项目2;项目描述2
$

请注意 -n ;这意味着不要打印,除非被告知这样做。

您可以在没有脚本文件的情况下执行此操作,但如果您使用脚本文件则无需担心。如果你小心的话,你甚至可以把它全部挤在一条线上。请注意,BSD sed h 之后的; >对GNU sed 无害但不重要。



修改



有各种各样的方法可以使它更接近防弹(但它们是否值得怀疑是值得商榷的)。例如:

  / *< div class =item_title> \(。* \)< \\ \\ / div> / 

可以修改为:

  / ^ [[:space:]] *< div class =item_title> [[:space:]] * \(。* \)[ [:space:]] *< \ / div> [[:space:]] * $ / 

处理< div> 组件之前,中间和之后的任意空白序列。重复其他正则表达式的恶心。你可以安排单词之间有单个空格。您可以安排多行描述作为单行打印一次,而不是像现在一样单独打印每一行。



您可以也可以将整个构造包装在文件里面:

  / ^< div class =result> $ /,/ ^< \ / div> $ / {
...脚本如前...
}

你可以重复这个想法,以便只在< div class =item> < ; / div> 等。


I'm trying to make a Bash script to extract results from an HTML page. I achieved to get the content of the page with Curl, but the next step is parsing the output, which is problematic.

The interesting content of the page looks like this:

<div class="result">
    ...
                <div class="item">
                    <div class="item_title">ITEM 1</div>
                </div>
                ...                                 
                <div class="item_desc">
                    ITEM DESCRIPTION 1
                </div>
...              
</div>
<div class="result">
    ...
                <div class="item">
                    <div class="item_title">ITEM 2</div>
                </div>
                ...                                 
                <div class="item_desc">
                    ITEM DESCRIPTION 2
                </div>
    ...              
</div>

I'd like to output something like:

ITEM1;ITEM DESCRIPTION 1
ITEM2;ITEM DESCRIPTION 2

I know a bit of Grep, but I can't wrap my mind about making it to work here, also some people told me to use Awk, which seems best suited for this kind of task.

I'd appreciate any help.

Thank you very much.

解决方案

A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:

sed.script

/ *<div class="item_title">\(.*\)<\/div>/ { s//\1/; h; }
/ *<div class="item_desc">/,/<\/div>/ {
    /<div class="item_desc">/d
    /<\/div>/d
    s/^  *//
    G
    s/\(.*\)\n\(.*\)/\2;\1/p
}

The first line matches item title lines. The s/// command captures just the part between the <div …> and </div>; the h copies that into the hold space (memory).

The rest of the script matches lines between the item description <div> and its </div>. The first two lines delete (ignore) the <div> and </div> lines. The s/// removes leading spaces; the G appends the hold space to the pattern space after a newline; the s///p captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.

Example

$ sed -n -f sed.script items.html
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
$

Note the -n; that means "don't print unless told to do so".

You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ; after the h is necessary with BSD sed and harmless but not crucial with GNU sed.

Modification

There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:

/ *<div class="item_title">\(.*\)<\/div>/

could be revised to:

/^[[:space:]]*<div class="item_title">[[:space:]]*\(.*\)[[:space:]]*<\/div>[[:space:]]*$/

to deal with arbitrary sequences of white space before, in the middle, and after the <div> components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.

You could also wrap the whole construct in the file inside:

/^<div class="result">$/,/^<\/div>$/ {
    …script as before…
}

And you could repeat that idea so that the item title is only picked inside <div class="item"> and </div>, etc.

这篇关于使用AWK / Grep / Bash从HTML中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆