如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理,并将它们存储在一个文本文件 [英] How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

查看:130
本文介绍了如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理,并将它们存储在一个文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件就是 HTML ,它具有约150锚标记。我只需要从这些标签,AKA的链接。我想只有 http://www.google.com 的一部分。

当我运行grep命令,

 猫website.htm |的grep -E'&下; A HREF = GT*;' > links.txt

这将返回整条生产线,我认为这发现不是我想要的链接,所以我尝试使用的命令:

 猫drawspace.txt |的grep -E'&下; A HREF = GT*;' |切-d'--output分隔符= $的'\\ n'> links.txt

除了它是错误的,它不工作,给我关于错误的参数一定的误差。所以我认为该文件应该一起过传递。也许就像切-d'--output分隔符= $的'\\ n'grepedText.txt> links.txt

不过,我想这样做在一个命令,如果可能的...所以我试图做一个 AWK 命令。

 猫drawspace.txt | grep的'&下; A HREF = GT*;' | AWK'{打印$ 2}'

但是,这不会任一运行。有人问我要更多的投入,因为我还没说完......

我试着写一个批处理文件,它告诉我FINDSTR不是内部或外部命令......所以,我认为我的环境变量搞砸而非修复,我试图在Windows上安装的grep,但是这给了我同样的错误......

现在的问题是,什么是去掉从 HTML 中的HTTP链接的正确方法?随着我将使我的情况下工作。

P.S。我读过这么多的链接/堆栈 溢出职位,显示我引用的时间太长....如果需要例如HTML显示过程的复杂性,然后我将它添加

我也有一个Mac和PC,我来回切换,它们之间用自己的壳/批号/ grep命令/终端命令,所以要么还是会帮我。

我也想指出,我在正确的目录

HTML

 < TR = VALIGN顶>
    < TD类=初学者>
      B03&安培; NBSP;&安培; NBSP;
    < / TD>
    &所述; TD>
        &所述; A HREF =htt​​p://www.drawspace.com/lessons/b03/simple-symmetry>简单对称性及所述; / A> < / TD>
< / TR>< TR = VALIGN顶>
  < TD类=初学者>
    B04&安培; NBSP;&安培; NBSP;
  < / TD>
  &所述; TD>
      < A HREF =htt​​p://www.drawspace.com/lessons/b04/faces-and-a-vase>面和一个花瓶LT; / A> < / TD>
< / TR>< TR = VALIGN顶>
    < TD类=初学者>
      B05&安培; NBSP;&安培; NBSP;
    < / TD>
    &所述; TD>
      < A HREF =htt​​p://www.drawspace.com/lessons/b05/blind-contour-drawing>盲轮廓图< / A> < / TD>
< / TR>< TR = VALIGN顶>
    < TD类=初学者>
        B06&安培; NBSP;&安培; NBSP;
    < / TD>
    &所述; TD>
      < A HREF =htt​​p://www.drawspace.com/lessons/b06/seeing-values​​>看到价值和LT; / A> < / TD>
< / TR>

期望的输出:

  http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
等等


解决方案

  $ SED的-n /.* HREF =\\([^] * \\)。* / \\ 1 / p'文件
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, . I want to get only the http://www.google.com part.

When I run a grep,

cat website.htm | grep -E '<a href=".*">' > links.txt

this returns the entire line to me that it found on not the link I want, so I tried using a cut command:

cat drawspace.txt | grep -E '<a href=".*">' | cut -d’"’ --output-delimiter=$'\n' > links.txt

Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’"’ --output-delimiter=$'\n' grepedText.txt > links.txt.

But I wanted to do this in one command if possible... So I tried doing an AWK command.

cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’

But this wouldn't run either. It was asking me for more input, because I wasn't finished....

I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....

The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.

P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.

I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.

I also want to point out I'm in the correct directory

HTML:

<tr valign="top">
    <td class="beginner">
      B03&nbsp;&nbsp;
    </td>
    <td>
        <a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a>  </td>
</tr>

<tr valign="top">
  <td class="beginner">
    B04&nbsp;&nbsp;
  </td>
  <td>
      <a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
      B05&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
        B06&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>

Expected output:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.

解决方案

$ sed -n 's/.*href="\([^"]*\).*/\1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

这篇关于如何去掉所有的HTML文件中的Bash的链接或者用grep或批处理,并将它们存储在一个文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆