模式匹配的网址分类 [英] Pattern Matching for URL classification

查看:107
本文介绍了模式匹配的网址分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为一个项目,我和其他几个人正在工作的一个URL分类的一部分。我们正在努力实现其实很简单:我们简单看一下网址,发现相关的关键字,在它存在的并在页面相应分类

As a part of a project, me and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple : we simply look at the URL and find relevant keywords occuring within it and classify the page accordingly.

例如:如果URL是:的http:// cnnworld /运动/ ABCD ,我们会在它分类类别运动

Eg : If the url is : http://cnnworld/sports/abcd, we would classify it under the category "sports"

要做到这一点,我们与格式映射数据库:关键词 - >分类

To accomplish this, we have a database with mappings of the format : Keyword -> Category

现在我们正在做的是,每个网址,我们不断读取所有数据项在数据库中,并使用String.find()方法来查看是否关键字出现在URL中。一旦被发现,我们停下来。

Now what we are currently doing is, for each URL, we keep reading all the data items within the database, and using String.find() method to see if the keyword occurs within the URL. Once this is found, we stop.

但这种方法有一些问题,其中主要有:

But this approach has a few problems, the main ones being :

(一)我们的数据库是非常大,这样的重复查询的运行极为缓慢

(i) Our database is very big and such repeated querying runs extremely slowly

(二)一个页面可能属于多个类别,我们的方法不处理这样的情况。中 - 当然,一个简单的方法,以确保这将是继续查询类别找到匹配的数据库,甚至有一次,但这只会使事情变得更慢。

(ii) A page may belong to more than one category and our approach does not handle such cases. Of-course, one simple way to ensure this would be to continue querying the database even once a category match is found, but this would only make things even slower.

我在想的替代品,并想知道如果反向可以这样做 - 解析URL,言语中是存在的,然后在数据库中查询那些话只是

I was thinking of alternatives and was wondering if the reverse could be done - Parse the url, find words occuring within it and then query the database for those words only.

一个天真的算法,这将运行在O(N ^ 2) - 查询数据库的URL中出现的所有子。

A naive algorithm for this would run in O( n^2 ) - query the database for all substrings that occur within the url.

我想知道是否有任何更好的方法来做到这一点。有任何想法吗 ??预先感谢您:)

I was wondering if there was any better approach to accomplish this. Any ideas ?? Thank you in advance :)

推荐答案

在我们的商业分类,我们有4米关键字数据库:)我们也搜索HTML的身体,有办法解决这个号码:

In our commercial classifier we have a database of 4m keywords :) and we also search the body of the HTML, there are number of ways to solve this:

  1. 使用阿霍Corasick,我们使用了改进算法特别是与网络上的内容,例如治疗:制表符,空格,\ r \ N作为的空间,因为只有一个,让两个空间将被视为一个空格,同时也忽略了低/大写字母。
  2. 另一种方法是把所有的关键字树内(性病::地图为例)所以搜索变得非常快,缺点是,这需要记忆的,很多,但如果它是一台服务器上,你不会告发T感受到这一点。

这篇关于模式匹配的网址分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆