从使用脚本网站提取电子邮件地址 [英] Extract email addresses from a website using scripts
问题描述
,我不知道什么是最好的方法,编程和/或使用脚本,提取了在从该链接的格式XXXX@YYYYY.ZZZZ纯文本每页上present所有电子邮件地址和所有的网站下,递归或者一些固定的深度。
Given a website, I wonder what is the best procedure, programmatically and/or using scripts, to extract all email addresses that are present on each page in plain text in the form XXXX@YYYYY.ZZZZ from that link and all sites underneath, recursively or until some fixed depth.
推荐答案
使用shell编程可以实现使用2个节目管道连接到一起你的目标:
Using shell programming you can achieve your goal using 2 programs piped together:
- wget: will get all pages
- grep: will filter and give you only the emails
一个例子:
wget -q -r -l 5 -O - http://somesite.com/ | grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b"
的wget ,在安静模式下( -q ),递归获得所有页面( -r )用5最大深度级别( -l 5 )从somesite.com.br并打印所有到标准输出( -O - )
wget, in quiet mode (-q), is getting all pages recursively (-r) with maximum depth level of 5 (-l 5) from somesite.com.br and printing everything to stdout (-O -).
的grep 是使用扩展的正前pression( -E ),并只显示( -o )的电子邮件地址。
grep is using an extended regular expression (-E) and showing only (-o) email address.
所有电子邮件都将被打印到标准输出,并可以通过附加&GT它们写入文件; somefile.txt
来的命令。
All emails are going to be printed to standard output and you can write them to a file by appending > somefile.txt
to the command.
阅读人
有关更多的文档页面wget的和 grep的。
这个例子与GNU 庆典版本4.2.37(1) - 释放测试,GNU的grep 2.12和GNU Wget的1.13.4。
This example was tested with GNU bash version 4.2.37(1)-release, GNU grep 2.12 and GNU Wget 1.13.4.
这篇关于从使用脚本网站提取电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!