抓取网站并仅返回 URL [英] Spider a Website and Return URLs Only

查看：35 发布时间：2022/1/6 13:21:44 grep uri wget web-crawler

本文介绍了抓取网站并仅返回 URL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种对网站进行伪蜘蛛化的方法.关键是我实际上并不想要内容，而是一个简单的 URI 列表.我可以通过 Wget 使用 --spider 来合理地接近这个想法code> 选项，但是当通过 grep 管道输出时，我似乎找不到正确的魔法来使它工作:


I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

grep 过滤器似乎对 wget 输出绝对没有影响.是我做错了什么，还是我应该尝试其他更适合提供这种有限结果集的工具?
The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?
更新
所以我刚刚离线发现，默认情况下，wget 会写入 stderr.我在手册页中错过了它(事实上，如果它在那里，我仍然没有找到它).一旦我通过管道返回标准输出，我就更接近我需要的东西了:
So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:
wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

我仍然对做这种事情的其他/更好的方法感兴趣，如果有的话.
I'd still be interested in other/better means for doing this kind of thing, if any exist.
推荐答案
我想做的绝对最后 事情是自己下载和解析所有内容(即创建我自己的蜘蛛).一旦我了解到 Wget 默认写入 stderr，我就能够将它重定向到 stdout 并适当地过滤输出.
The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
wget --spider --force-html -r -l2 $url 2>&1 
  | grep '^--' | awk '{ print $3 }' 
  | grep -v '.(css|js|png|gif|jpg)$' 
  > urls.m3u

这给了我被爬取的内容资源(非图像、CSS 或 JS 源文件的资源)URI 的列表.从那里，我可以将 URI 发送到第三方工具进行处理以满足我的需求.
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.
输出仍然需要稍微简化(它会产生重复，如上所示)，但它几乎就完成了，我不必自己进行任何解析.
The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

                        这篇关于抓取网站并仅返回 URL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

抓取网站并仅返回 URL [英] Spider a Website and Return URLs Only

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

抓取网站并仅返回 URL [英] Spider a Website and Return URLs Only

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭