以编程方式从域名中提取关键字 [英] Programmatically extract keywords from domain names

查看:125
本文介绍了以编程方式从域名中提取关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个我想分析的域名列表。除非域名被连字符化,否则我看不到一个特别简单的方式来提取域中使用的关键字。然而,我在DomainTools.com,Estibot.com等网站上看到了,例如:

  ilikecheese.com 我喜欢奶酪
sanfranciscohotels.com成为圣弗朗西斯科酒店
...

有效实现这一点的任何建议?



编辑:我想在PHP中写这个。

解决方案

好的,我运行了我为,有几个小的变化 - 使用日志概率来避免下溢,并修改它以读取多个文件作为语料库。



对于我的语料库,我从项目Gutenberg下载了一堆文件 - 没有真正的方法,只是从etext00,etext01和etext02中获取所有英文文件。



以下是结果,我为每个组合保存了前三名。

 
expertsexchange:97可能性
- 专家交换-23.71
- 专家性别变化-31.46
- 专家改变-33.86

penisland:11可能性
- 笔岛-20.54
- 阴茎土地-22.64
- 笔是土地-25.06

choosespain:28种可能性
- 选择西班牙-21.17
- 选择疼痛-23.06
- 选择水疗-29.41

kidsexpress:15种可能性
- 孩子们表达-23.56
- 孩子性别按-32.65
- 孩子ex press -34.98

儿童服装:34种可能性
- 儿童发誓-19.85
- 儿童服装-25.26
- 儿童发誓-32.70

dicksonweb:8种可能性
- dickson web -27.09
- 迪克儿子网络-30.51
- 在网络上打篮球-33.63


Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:

ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...

Any suggestions for accomplishing this efficiently and effectively?

Edit: I'd like to write this in PHP.

解决方案

Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.

For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.

Below are the results, I saved the top three for each combination.

expertsexchange: 97 possibilities
 -  experts exchange -23.71
 -  expert sex change -31.46
 -  experts ex change -33.86

penisland: 11 possibilities
 -  pen island -20.54
 -  penis land -22.64
 -  pen is land -25.06

choosespain: 28 possibilities
 -  choose spain -21.17
 -  chooses pain -23.06
 -  choose spa in -29.41

kidsexpress: 15 possibilities
 -  kids express -23.56
 -  kid sex press -32.65
 -  kids ex press -34.98

childrenswear: 34 possibilities
 -  children swear -19.85
 -  childrens wear -25.26
 -  child ren swear -32.70

dicksonweb: 8 possibilities
 -  dickson web -27.09
 -  dick son web -30.51
 -  dicks on web -33.63

这篇关于以编程方式从域名中提取关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆