获取在bash随机网站名称 [英] Get random site names in bash

查看:206
本文介绍了获取在bash随机网站名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做了计算的话在网上分布的脚本。我要做的就是检查许多随机的网站,我可以和计数在这些网站的字数,一一列举,并责令他们,使发生多数次的字是列表的顶部。我在做什么是随机生成的IP号码:

I'm making a script that calculate the distribution of the words in the web. What I have to do is check as many random web sites as I can and count the number of words in those sites, list them, and order them so that the word that occurs the majority of the times is a the top of the list. What I'm doing is generating random ip numbers:

a=`expr $RANDOM % 255`
let "a+=1"
b=`expr $RANDOM % 256`
c=`expr $RANDOM % 256`
d=`expr $RANDOM % 256`
ip=$a.$b.$c.$d

后使用nmap我检查,如果端口80或8080为这些网站上的打开,以便有机会,这是一个网站。

after that with nmap I check if the port 80 or 8080 is open on those sites so that there is a chance that it's a web site.

如果我敢肯定,知识产权不属于一个网站,我的地址添加到黑名单文件,以便它不会再次检查。

if I'm sure the ip doesn't belong to a web site I add the address to a black list file so that it doesn't get checked again.

如果端口80或8080端口是开放的话,我必须解决与反向查找的IP地址和获得属于该IP的所有域名。

If the port 80 or the port 8080 is open then I have to resolve the ip with a reverse lookup and get all the domain names that belong to that ip.

问题是,如果我做这些命令之一,产量仅是单一的PTR记录,同时可以有多个:

the problem is that if I do one of these commands, the output is only the single PTR record, while there can be multiple:

dig -x ipaddres +short
nslookup ipaddress
host ipaddress

我prefere这在bash解决,但如果用C的解决方案,它可以帮助以及

I prefere this to be solved in bash, but if there is solution in C, it could help as well

在我的网站页面复制到文件中的w3m使用和我算这个词出现。

After that I copy the web site page to a file using w3m and I count the word occurrences.

另外这里我还有一个问题,有没有办法来检查人属于该网站,并可用的公共网页不仅索引呢?

Also here I have another problem, is there a way to check al the available public pages that belong to the site and not only the index one?

任何帮助是pciated AP $ P $

Any help is appreciated

推荐答案

很多网站都没有通过IP地址访问纯粹是由于虚拟主机和这样的。我不知道你会做这个越来越词在网络上的均匀分布。另外的IP地址在该主机网站并不是真正由随机产生的32位数字均匀分布。与多数真正的网站托管公司将集中在小范围内,和许多其他IP将是互联网服务供应商的终结点可能没什么主办。

A lot of websites are not accessible purely by the IP address, due to virtual hosts and such. I'm not sure you'd be getting a uniform distribution of words on the web by doing this. Moreover IP addresses that host websites are not really evenly distributed over by randomly generating 32-bit numbers. Hosting companies with the majority of real websites will be concentrated in small ranges, and a lot of other IPs will be endpoints of ISPs with probably nothing hosted.

鉴于上述情况,以及你正在试图解决这个问题,我真的建议得到URL的分布抓取,并计算这些词的频率。这样做的一个很好的工具,将像WWW:机械化在Python,Perl中,红宝石等。由于你的限制因素将是你的互联网连接,而不是你的处理速度,有没有优势,在低级别这样做语言。通过这种方式,你必须打在同一个IP多个站点的机会较高。

Given the above, and the problem you are trying to solve, I would actually recommend getting a distribution of URLs to crawl and computing the word frequency on those. A good tool for doing that would be something like WWW:Mechanize in Python, Perl, Ruby, etc. As your limiting factor is going to be your internet connection and not your processing speed, there's no advantage to doing this in a low-level language. This way, you'll have a higher chance of hitting multiple sites at the same IP.

这篇关于获取在bash随机网站名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆