如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理，并将它们存储在一个文本文件 [英] How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

查看：130 发布时间：2016/7/28 14:52:25 bash shell awk grep cut

本文介绍了如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理，并将它们存储在一个文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件就是 HTML ，它具有约150锚标记。我只需要从这些标签，AKA的链接。我想只有 http://www.google.com 的一部分。

当我运行grep命令，

 猫website.htm |的grep -E'＆下; A HREF = GT*;' ＆GT; links.txt

这将返回整条生产线，我认为这发现不是我想要的链接，所以我尝试使用的切命令：

 猫drawspace.txt |的grep -E'＆下; A HREF = GT*;' |切-d'--output分隔符= $的'\\ n'＆GT; links.txt

除了它是错误的，它不工作，给我关于错误的参数一定的误差。所以我认为该文件应该一起过传递。也许就像切-d'--output分隔符= $的'\\ n'grepedText.txt＆GT; links.txt 。

不过，我想这样做在一个命令，如果可能的...所以我试图做一个 AWK 命令。

 猫drawspace.txt | grep的'＆下; A HREF = GT*;' | AWK'{打印$ 2}'

但是，这不会任一运行。有人问我要更多的投入，因为我还没说完......

我试着写一个批处理文件，它告诉我FINDSTR不是内部或外部命令......所以，我认为我的环境变量搞砸而非修复，我试图在Windows上安装的grep，但是这给了我同样的错误......

现在的问题是，什么是去掉从 HTML 中的HTTP链接的正确方法？随着我将使我的情况下工作。

P.S。我读过这么多的链接/堆栈＆NBSP;溢出职位，显示我引用的时间太长....如果需要例如HTML显示过程的复杂性，然后我将它添加

我也有一个Mac和PC，我来回切换，它们之间用自己的壳/批号/ grep命令/终端命令，所以要么还是会帮我。

我也想指出，我在正确的目录

HTML

 ＆LT; TR = VALIGN顶＆GT;
    ＆LT; TD类=初学者＆GT;
      B03＆安培; NBSP;＆安培; NBSP;
    ＆LT; / TD＆GT;
    ＆所述; TD＆GT;
        ＆所述; A HREF =http://www.drawspace.com/lessons/b03/simple-symmetry＆GT;简单对称性及所述; / A＆GT; ＆LT; / TD＆GT;
＆LT; / TR＆GT;＆LT; TR = VALIGN顶＆GT;
  ＆LT; TD类=初学者＆GT;
    B04＆安培; NBSP;＆安培; NBSP;
  ＆LT; / TD＆GT;
  ＆所述; TD＆GT;
      ＆LT; A HREF =http://www.drawspace.com/lessons/b04/faces-and-a-vase＆GT;面和一个花瓶LT; / A＆GT; ＆LT; / TD＆GT;
＆LT; / TR＆GT;＆LT; TR = VALIGN顶＆GT;
    ＆LT; TD类=初学者＆GT;
      B05＆安培; NBSP;＆安培; NBSP;
    ＆LT; / TD＆GT;
    ＆所述; TD＆GT;
      ＆LT; A HREF =http://www.drawspace.com/lessons/b05/blind-contour-drawing＆GT;盲轮廓图＆LT; / A＆GT; ＆LT; / TD＆GT;
＆LT; / TR＆GT;＆LT; TR = VALIGN顶＆GT;
    ＆LT; TD类=初学者＆GT;
        B06＆安培; NBSP;＆安培; NBSP;
    ＆LT; / TD＆GT;
    ＆所述; TD＆GT;
      ＆LT; A HREF =http://www.drawspace.com/lessons/b06/seeing-values＆GT;看到价值和LT; / A＆GT; ＆LT; / TD＆GT;
＆LT; / TR＆GT;

期望的输出：

  http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
等等

解决方案

  $ SED的-n /.* HREF =\\（[^] * \\）。* / \\ 1 / p'文件
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, . I want to get only the http://www.google.com part.

When I run a grep,

cat website.htm | grep -E '<a href=".*">' > links.txt

this returns the entire line to me that it found on not the link I want, so I tried using a cut command:

cat drawspace.txt | grep -E '<a href=".*">' | cut -d’"’ --output-delimiter=$'\n' > links.txt

Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’"’ --output-delimiter=$'\n' grepedText.txt > links.txt.

But I wanted to do this in one command if possible... So I tried doing an AWK command.

cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’

But this wouldn't run either. It was asking me for more input, because I wasn't finished....

I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....

The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.

P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.

I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.

I also want to point out I'm in the correct directory

HTML:

<tr valign="top">
    <td class="beginner">
      B03&nbsp;&nbsp;
    </td>
    <td>
        <a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a>  </td>
</tr>

<tr valign="top">
  <td class="beginner">
    B04&nbsp;&nbsp;
  </td>
  <td>
      <a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
      B05&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
        B06&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>

Expected output:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.

解决方案

$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

这篇关于如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理，并将它们存储在一个文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理，并将它们存储在一个文本文件 [英] How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理，并将它们存储在一个文本文件 [英] How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

问题描述

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭