如何在 Bash 或 grep 或批处理中删除 HTML 文件的所有链接并将它们存储在文本文件中 [英] How to strip out all of the links of an HTML file in Bash or grep or batch and store them in a text file

查看:18
本文介绍了如何在 Bash 或 grep 或批处理中删除 HTML 文件的所有链接并将它们存储在文本文件中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 HTML 文件,它有大约 150 个锚标记.我只需要来自这些标签的链接,也就是 <a href="*http://www.google.com*"></a>.我只想获得 http://www.google.com 部分.

I have a file that is HTML, and it has about 150 anchor tags. I need only the links from these tags, AKA, <a href="*http://www.google.com*"></a>. I want to get only the http://www.google.com part.

当我运行 grep 时,

When I run a grep,

cat website.htm | grep -E '<a href=".*">' > links.txt

这会将整行返回给我,它不是我想要的链接,所以我尝试使用 cut 命令:

this returns the entire line to me that it found on not the link I want, so I tried using a cut command:

cat drawspace.txt | grep -E '<a href=".*">' | cut -d’"’ --output-delimiter=$'
' > links.txt

除了它是错误的,它不起作用给我一些关于错误参数的错误......所以我假设文件也应该被传递.也许像 cut -d'"' --output-delimiter=$' ' grpedText.txt >链接.txt.

Except that it is wrong, and it doesn't work give me some error about wrong parameters... So I assume that the file was supposed to be passed along too. Maybe like cut -d’"’ --output-delimiter=$' ' grepedText.txt > links.txt.

但如果可能的话,我想在一个命令中执行此操作...所以我尝试执行一个 AWK 命令.

But I wanted to do this in one command if possible... So I tried doing an AWK command.

cat drawspace.txt | grep '<a href=".*">' | awk '{print $2}’

但这也不会运行.它要求我提供更多意见,因为我还没有完成......

But this wouldn't run either. It was asking me for more input, because I wasn't finished....

我尝试编写一个批处理文件,它告诉我 FINDSTR 不是内部或外部命令...所以我假设我的环境变量搞砸了,而不是修复我尝试在 Windows 上安装 grep 的问题,但这给了我同样的错误....

I tried writing a batch file, and it told me FINDSTR is not an internal or external command... So I assume my environment variables were messed up and rather than fix that I tried installing grep on Windows, but that gave me the same error....

问题是,从 HTML 中去除 HTTP 链接的正确方法是什么?有了它,我将使它适合我的情况.

The question is, what is the right way to strip out the HTTP links from HTML? With that I will make it work for my situation.

附言我已经阅读了很多链接/Stack Overflow 帖子,显示我的参考文献会花费太长时间......如果需要示例 HTML 来显示过程的复杂性,那么我会添加它.

P.S. I've read so many links/Stack Overflow posts that showing my references would take too long.... If example HTML is needed to show the complexity of the process then I will add it.

我还有一台 Mac 和 PC,我在它们之间来回切换以使用它们的 shell/batch/grep 命令/终端命令,所以要么或将帮助我.

I also have a Mac and PC which I switched back and forth between them to use their shell/batch/grep command/terminal commands, so either or will help me.

我还想指出我在正确的目录中

I also want to point out I'm in the correct directory

HTML:

<tr valign="top">
    <td class="beginner">
      B03&nbsp;&nbsp;
    </td>
    <td>
        <a href="http://www.drawspace.com/lessons/b03/simple-symmetry">Simple Symmetry</a>  </td>
</tr>

<tr valign="top">
  <td class="beginner">
    B04&nbsp;&nbsp;
  </td>
  <td>
      <a href="http://www.drawspace.com/lessons/b04/faces-and-a-vase">Faces and a Vase</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
      B05&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b05/blind-contour-drawing">Blind Contour Drawing</a> </td>
</tr>

<tr valign="top">
    <td class="beginner">
        B06&nbsp;&nbsp;
    </td>
    <td>
      <a href="http://www.drawspace.com/lessons/b06/seeing-values">Seeing Values</a> </td>
</tr>

预期输出:

http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
etc.

推荐答案

$ sed -n 's/.*href="([^"]*).*/1/p' file
http://www.drawspace.com/lessons/b03/simple-symmetry
http://www.drawspace.com/lessons/b04/faces-and-a-vase
http://www.drawspace.com/lessons/b05/blind-contour-drawing
http://www.drawspace.com/lessons/b06/seeing-values

这篇关于如何在 Bash 或 grep 或批处理中删除 HTML 文件的所有链接并将它们存储在文本文件中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆