从 url 中提取 TLD 并对每个 TLD 文件的域和子域进行排序 [英] Extraction of TLD from urls and sorting domains and subdomains for each TLD file
本文介绍了从 url 中提取 TLD 并对每个 TLD 文件的域和子域进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含数百万个网址的列表.我需要为每个 url 提取 TLD 并为每个 TLD 创建多个文件.例如,收集所有带有 .com 的 url 作为 tld 并将其转储到 1 个文件中,另一个文件 .edu tld 等等.进一步在每个文件中,我必须按域的字母顺序排序,然后按子域等.
I have a list of million urls. I need to extract the TLD for each url and create multiple files for each TLD. For example collect all urls with .com as tld and dump that in 1 file, another file for .edu tld and so on. Further within each file, I have to sort it alphabetically by domains and then by subdomains etc.
谁能给我一个在 perl 中实现它的先机?
Can anyone give me a head start for implementing this in perl?
推荐答案
- 使用 URI 解析 URL,
- 使用其
host
方法获取主机, - 使用Domain::PublicSuffix的
get_root_domain
解析主机名. - 使用
tld
或suffix
方法获取真实 TLD 或伪 TLD.
- Use URI to parse the URL,
- Use its
host
method to get the host, - Use Domain::PublicSuffix's
get_root_domain
to parse the host name. - Use the
tld
orsuffix
method to get the real TLD or the pseudo TLD.
use feature qw( say );
use Domain::PublicSuffix qw( );
use URI qw( );
my $dps = Domain::PublicSuffix->new();
for (qw(
http://www.google.com/
http://www.google.co.uk/
)) {
my $url = $_;
# Treat relative URLs as absolute URLs with missing http://.
$url = "http://$url" if $url !~ /^w+:/;
my $host = URI->new($url)->host();
$host =~ s/.z//; # D::PS doesn't handle "domain.com.".
$dps->get_root_domain($host)
or die $dps->error();
say $dps->tld(); # com uk
say $dps->suffix(); # com co.uk
}
这篇关于从 url 中提取 TLD 并对每个 TLD 文件的域和子域进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文