从URL中提取TLD，并为每个TLD文件分类域和子域 [英] Extraction of TLD from urls and sorting domains and subdomains for each TLD file

查看：174 发布时间：2017/6/9 20:16:45 perl url dns tld

本文介绍了从URL中提取TLD，并为每个TLD文件分类域和子域的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个百万个网址的列表。
我需要提取每个网址的TLD，并为每个TLD创建多个文件。
例如，将.com中的所有url收集为tld并将其转储到1个文件中，另一个文件为.edu tld等等。
在每个文件中，我必须按字母顺序按域排序，然后按子域名排序。

任何人都可以开始在perl中实现这个？

解决方案

使用 URI 来解析URL，

使用主机方法获取主持人

使用 Domain :: PublicSuffix get_root_domain 来解析主机名。

使用 tld 或后缀方法来获取真正的TLD或伪TLD。

＃x20;

 使用功能qw（说）; 
 
使用Domain :: PublicSuffix qw（）; 
使用URI qw（）; 
 
我的$ dps = Domain :: PublicSuffix-> new（）; 
 
（qw（
 http://www.google.com/ 
 http://www.google.co.uk/ 
））{
我的$ url = $ _; 
 
＃将相对URL视为缺少http：//的绝对URL。 
 $ url =http：// $ urlif $ url！〜/ ^ \w +：/; 
 
 my $ host = URI-> new（$ url） - > host（）; 
 $ host =〜s / \.\z //; ＃D :: PS不处理domain.com。 
 
 $ dps-> get_root_domain（$ host）
或死$ dps-> error（）; 
 
说$ dps-> tld（）; ＃com uk 
说$ dps-> suffix（）; ＃com co.uk 
}

I have a list of million urls. I need to extract the TLD for each url and create multiple files for each TLD. For example collect all urls with .com as tld and dump that in 1 file, another file for .edu tld and so on. Further within each file, I have to sort it alphabetically by domains and then by subdomains etc.

Can anyone give me a head start for implementing this in perl?

解决方案

Use URI to parse the URL,
Use its host method to get the host,
Use Domain::PublicSuffix's get_root_domain to parse the host name.
Use the tld or suffix method to get the real TLD or the pseudo TLD.

use feature qw( say );

use Domain::PublicSuffix qw( );
use URI                  qw( );

my $dps = Domain::PublicSuffix->new();

for (qw(
   http://www.google.com/
   http://www.google.co.uk/
)) {
   my $url = $_;

   # Treat relative URLs as absolute URLs with missing http://.
   $url = "http://$url" if $url !~ /^\w+:/;

   my $host = URI->new($url)->host();
   $host =~ s/\.\z//;  # D::PS doesn't handle "domain.com.".

   $dps->get_root_domain($host)
      or die $dps->error();

   say $dps->tld();     # com  uk
   say $dps->suffix();  # com  co.uk
}

这篇关于从URL中提取TLD，并为每个TLD文件分类域和子域的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从URL中提取TLD，并为每个TLD文件分类域和子域 [英] Extraction of TLD from urls and sorting domains and subdomains for each TLD file

问题描述

相关文章

Wireless/无线最新文章

热门教程

热门工具

登录关闭

从URL中提取TLD，并为每个TLD文件分类域和子域 [英] Extraction of TLD from urls and sorting domains and subdomains for each TLD file

问题描述

相关文章

Wireless/无线最新文章

热门教程

热门工具

登录 关闭

登录关闭