提取部分代码并在bash中解析HTML [英] Extract part of the code and parse HTML in bash
问题描述
我有外部HTML网站,我需要从该网站上的表中提取数据.但是HTML网站的源代码格式错误,除了代码中的表格外,因此我不能使用
I have external HTML site and I need to extract data from the table on that site. However source of the HTML website has wrong formatting except the table in the code, so I can not use
xmllint --html --xpath <xpath> <file>
当网站上的HTML格式被破坏时,由于它无法正常工作.
because it does not work properly, when HTML formatting on the site is broken.
我的想法是在表的上方和下方使用curl和删除代码.提取表后,代码很干净,适合 xmllint 工具(然后我可以使用xpath).但是,删除匹配项之上的所有内容对于shell来说都是一个挑战,如您在此处看到的:
My idea was to use curl and delete code above and below the table. When table is extracted, code is clean and it fits to xmllint tool (I can use xpath then). However delete everything above the match is challenging for shell as you can see here: Sed doesn't backtrack: once it's processed a line, it's done. Is there a way how to extract only the code of the table from the HTML site in bash? Suposse, code has this structure.
<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
<p>... dolor.</p>
</body>
</html>
我需要这样的输出才能正确解析数据:
And I need output like this to parse data properly:
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
请不要因为尝试使用bash而给我减号.
Please, do not give me minus because of trying to use bash.
推荐答案
我将使用 xmllint
,它支持用于解析html
文件的--html
标志
首先,您可以通过如下所示分析HTML文件的完整性,以确认文件是否符合标准或抛出错误(如果看到):-
Firstly you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-
$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
<p>... dolor.</p>
</body>
</html>
我的原始YourHTML.html
文件只是您问题中的输入HTML文件.
with my original YourHTML.html
file just being the input HTML file in your question.
现在是值提取部分:-
开始从根节点到table
节点(//html/body/table
)的文件解析,并在HTML解析器&中运行xmllint
.交互式外壳模式(xmllint --html --shell
)
Starting the file parsing from root-node to the table
node (//html/body/table
) and running xmllint
in HTML parser & interactive shell mode (xmllint --html --shell
)
简单地运行命令会产生结果,
Running the command plainly produces a result,
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html
/ > -------
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
/ >
现在使用sed
删除特殊字符,即sed '/^\/ >/d'
产生
Now removing the special characters using sed
i.e. sed '/^\/ >/d'
produces
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d'
<table class="my-table">
<tr>
<th>Company</th>
<th>Contact</th>
</tr>
</table>
这是您期望的输出结构.在xmllint: using libxml version 20900
which is the output structure as you expected. Tested on xmllint: using libxml version 20900
我将再向前走一步,如果您想获取table
标记中的值,则可以应用sed
命令将其提取为
I will go one more step ahead, and if you want to fetch the values within the table
tag, you can apply the sed
command to extract them as
$ echo "cat //html/body/table" | xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | xargs
Company Contact
这篇关于提取部分代码并在bash中解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!