以编程方式从域名中提取关键字 [英] Programmatically extract keywords from domain names
问题描述
ilikecheese.com 我喜欢奶酪
sanfranciscohotels.com成为圣弗朗西斯科酒店
...
有效实现这一点的任何建议?
编辑:我想在PHP中写这个。
好的,我运行了我为,有几个小的变化 - 使用日志概率来避免下溢,并修改它以读取多个文件作为语料库。
对于我的语料库,我从项目Gutenberg下载了一堆文件 - 没有真正的方法,只是从etext00,etext01和etext02中获取所有英文文件。
以下是结果,我为每个组合保存了前三名。
expertsexchange:97可能性
- 专家交换-23.71
- 专家性别变化-31.46
- 专家改变-33.86
penisland:11可能性
- 笔岛-20.54
- 阴茎土地-22.64
- 笔是土地-25.06
choosespain:28种可能性
- 选择西班牙-21.17
- 选择疼痛-23.06
- 选择水疗-29.41
kidsexpress:15种可能性
- 孩子们表达-23.56
- 孩子性别按-32.65
- 孩子ex press -34.98
儿童服装:34种可能性
- 儿童发誓-19.85
- 儿童服装-25.26
- 儿童发誓-32.70
dicksonweb:8种可能性
- dickson web -27.09
- 迪克儿子网络-30.51
- 在网络上打篮球-33.63
Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:
ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...
Any suggestions for accomplishing this efficiently and effectively?
Edit: I'd like to write this in PHP.
Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.
For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.
Below are the results, I saved the top three for each combination.
expertsexchange: 97 possibilities - experts exchange -23.71 - expert sex change -31.46 - experts ex change -33.86 penisland: 11 possibilities - pen island -20.54 - penis land -22.64 - pen is land -25.06 choosespain: 28 possibilities - choose spain -21.17 - chooses pain -23.06 - choose spa in -29.41 kidsexpress: 15 possibilities - kids express -23.56 - kid sex press -32.65 - kids ex press -34.98 childrenswear: 34 possibilities - children swear -19.85 - childrens wear -25.26 - child ren swear -32.70 dicksonweb: 8 possibilities - dickson web -27.09 - dick son web -30.51 - dicks on web -33.63
这篇关于以编程方式从域名中提取关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!