bash脚本:字出现在网站 [英] bash script: word occurrences in web sites

查看:114
本文介绍了bash脚本:字出现在网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要做计数在网页上的字出现一个脚本

I need to make a script that counts the word occurrences on the web pages

我在做什么是计算随机IP地址​​(避免检查相同的IPS不止一次),采用NMAP,看看80端口是开放的知道,如果它是一个Web服务器,然后我用w3m的使HTTP页面作为文件。
之后很容易算的话occurances

what I'm doing is calculating random IPs (avoiding checking the same ips more than once), using nmap to see if the port 80 is open to know if it's a web server and then I use w3m to make the http page as file. After it's easy to count the word occurances

我获得了一些问题和困难。

I got a few questions and problems


  • 这个过程需要花费很多,但我想不出任何方法,使之更快

  • 许多与端口的IPS 80开放不是网站,他们不起来,他们可能有一些问题,有没有只检查是吗?
  • 网站的任何方式
  • 这种方法只检查这个词出现在一个网站的索引页,有没有顺便也检查其他公共网页?

  • This process takes A LOT, but I can't think any method to make it quicker
  • Many of the ips with the port 80 open aren't web sites, they aren't up and they might have some problems, is there any ways to check only the sites that are up?
  • this method only check the word occurrences in the index page of a web site, is there a way to check also the other public pages?

非常感谢

推荐答案

我做类似的东西,使用awk。 AWK是真棒进行文本分析。我要做的就是分析很多HTTP如何获取的Apache日志进行每个IP地址。所以,机器人像你会在我的统计显示很容易:P使用awk我已经用我所有的同事制成,PHP,Ruby和bash脚本胜过所有的解决方案

I do similar stuff with awk. Awk is awesome for text parsing. What I do is analyze how many HTTP GETs each IP address done in the Apache log. So bots like yours would appear easily in my statistics :P With awk I've outperform all solutions made by all my collegues, in PHP, Ruby and bash script.

问题是你是不是产生每个文件(或页面)的统计数据。您正在总结所有的结果,不是吗?所以,我会使用SQLite来跟踪一个字有多少次出现在所有扫描的文本。这是很容易(和快速)在SQLite的一个shell脚本添加数据。

The problem is you are not generating statistics per file (or page). You are summing up all results, right? So I would use SQLite to keep track of how many times a word have appeared in all scanned texts. It is easy (and fast) to add data in SQLite with a shell script.

另外,应用户的wget --spider或其他蜘蛛HTTP客户端,因为他们会下载不仅从索引页,但是从已经在第一页的链接(的HREF),所有网页的内容。所以,你可以扫描一个网站递归。

Also you should user wget --spider or other spider HTTP clients because they will download the content not only from the index page but from all pages that has links (HREFs) in the first page. So you can scan a website recursively.

这篇关于bash脚本:字出现在网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆