提取部分代码并在bash中解析HTML [英] Extract part of the code and parse HTML in bash

查看:84
本文介绍了提取部分代码并在bash中解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有外部HTML网站,我需要从该网站上的表中提取数据.但是HTML网站的源代码格式错误,除了代码中的表格外,因此我不能使用

I have external HTML site and I need to extract data from the table on that site. However source of the HTML website has wrong formatting except the table in the code, so I can not use

xmllint --html --xpath <xpath> <file>

当网站上的HTML格式被破坏时,由于它无法正常工作.

because it does not work properly, when HTML formatting on the site is broken.

我的想法是在表的上方和下方使用curl和删除代码.提取表后,代码很干净,适合 xmllint 工具(然后我可以使用xpath).但是,删除匹配项之上的所有内容对于shell来说都是一个挑战,如您在此处看到的:

My idea was to use curl and delete code above and below the table. When table is extracted, code is clean and it fits to xmllint tool (I can use xpath then). However delete everything above the match is challenging for shell as you can see here: Sed doesn't backtrack: once it's processed a line, it's done. Is there a way how to extract only the code of the table from the HTML site in bash? Suposse, code has this structure.

<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
<p>... dolor.</p>
</body>
</html>

我需要这样的输出才能正确解析数据:

And I need output like this to parse data properly:

  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>

请不要因为尝试使用bash而给我减号.

Please, do not give me minus because of trying to use bash.

推荐答案

我将使用 xmllint ,它支持用于解析html文件的--html标志

首先,您可以通过如下所示分析HTML文件的完整性,以确认文件是否符合标准或抛出错误(如果看到):-

Firstly you can check the sanity of your HTML file by parsing it as below which confirms if the file is as per the standards or throws out errors if seen:-

$ xmllint --html YourHTML.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
</head>
<body>
<p>Lorem ipsum ....</p>
  <table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
<p>... dolor.</p>
</body>
</html>

我的原始YourHTML.html文件只是您问题中的输入HTML文件.

with my original YourHTML.html file just being the input HTML file in your question.

现在是值提取部分:-

开始从根节点到table节点(//html/body/table)的文件解析,并在HTML解析器&中运行xmllint.交互式外壳模式(xmllint --html --shell)

Starting the file parsing from root-node to the table node (//html/body/table) and running xmllint in HTML parser & interactive shell mode (xmllint --html --shell)

简单地运行命令会产生结果,

Running the command plainly produces a result,

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html
/ >  -------
<table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>
/ > 

现在使用sed删除特殊字符,即sed '/^\/ >/d'产生

Now removing the special characters using sed i.e. sed '/^\/ >/d' produces

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d'
<table class="my-table">
    <tr>
      <th>Company</th>
      <th>Contact</th>
    </tr>
  </table>

这是您期望的输出结构.在xmllint: using libxml version 20900

which is the output structure as you expected. Tested on xmllint: using libxml version 20900

我将再向前走一步,如果您想获取table标记中的值,则可以应用sed命令将其提取为

I will go one more step ahead, and if you want to fetch the values within the table tag, you can apply the sed command to extract them as

$ echo "cat //html/body/table" |  xmllint --html --shell YourHTML.html | sed '/^\/ >/d' | sed 's/<[^>]*.//g' | xargs
Company Contact

这篇关于提取部分代码并在bash中解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆