只蜘蛛网站和返回网址 [英] Spider a Website and Return URLs Only

查看：115 发布时间：2018/5/28 19:08:14 grep uri wget web-crawler

本文介绍了只蜘蛛网站和返回网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种伪蜘蛛网站的方式。关键是我实际上不需要内容，而是一个简单的URI列表。我可以通过 Wget 使用来合理地接近这个想法 - spider 选项，但是当通过 grep 输出时，我似乎无法找到正确的方法来实现它：

  wget --spider --force-html -r -l1 http://somesite.com | grep'Saving to：'

grep 过滤器似乎完全不会影响 wget 输出。我有什么问题吗？还是有另一种工具，我应该尝试更适合提供这种有限的结果集？

更新

因此，我只是在离线状态下发现，默认情况下， wget 写入stderr。我在手册页中错过了（实际上，如果它在那里，我还没有找到它）。一旦我将输入返回到标准输出，我接近我所需要的：

  wget --spider --force-html  - r -l1 http://somesite.com 2>& 1 | grep'Saving to：'

我仍然对其他更好的方法感兴趣的东西，如果有的话。

解决方案

我想要做的绝对的 last 解析所有的内容我自己（即创建我自己的蜘蛛）。一旦我知道Wget默认写入stderr，我就可以将它重定向到stdout并适当地过滤输出。

  wget --spider --force-html -r -l2 $ url 2>& 1 \ 
 | grep'^  - '| awk'{print $ 3}'\ 
 | grep -v'\.\（css \ | js\ | png\ | gif\ | jpg\）$'\ 
> urls.m3u

这给我一个内容资源列表（不是图片的资源，CSS或JS源文件）被蜘蛛的URI。从那里，我可以将URI发送给第三方工具进行处理以满足我的需求。

输出结果仍然需要稍微精简（它会产生重复，因为它是如上所示），但它几乎在那里，我不必自己解析。

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?

UPDATE

So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:

wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

I'd still be interested in other/better means for doing this kind of thing, if any exist.

解决方案

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.

wget --spider --force-html -r -l2 $url 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
  > urls.m3u

This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meed my needs.

The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

这篇关于只蜘蛛网站和返回网址的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

只蜘蛛网站和返回网址 [英] Spider a Website and Return URLs Only

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

只蜘蛛网站和返回网址 [英] Spider a Website and Return URLs Only

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭