模式匹配的网址分类 [英] Pattern Matching for URL classification

查看：107 发布时间：2015/11/30 20:28:47 string algorithm pattern-matching

本文介绍了模式匹配的网址分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

作为一个项目，我和其他几个人正在工作的一个URL分类的一部分。我们正在努力实现其实很简单：我们简单看一下网址，发现相关的关键字，在它存在的并在页面相应分类

As a part of a project, me and a few others are currently working on a URL classifier. What we are trying to implement is actually quite simple : we simply look at the URL and find relevant keywords occuring within it and classify the page accordingly.

例如：如果URL是：的http：// cnnworld /运动/ ABCD ，我们会在它分类类别运动

Eg : If the url is : http://cnnworld/sports/abcd, we would classify it under the category "sports"

要做到这一点，我们与格式映射数据库：关键词 - >分类

To accomplish this, we have a database with mappings of the format : Keyword -> Category

现在我们正在做的是，每个网址，我们不断读取所有数据项在数据库中，并使用String.find（）方法来查看是否关键字出现在URL中。一旦被发现，我们停下来。

Now what we are currently doing is, for each URL, we keep reading all the data items within the database, and using String.find() method to see if the keyword occurs within the URL. Once this is found, we stop.

但这种方法有一些问题，其中主要有：

But this approach has a few problems, the main ones being :

（一）我们的数据库是非常大，这样的重复查询的运行极为缓慢

(i) Our database is very big and such repeated querying runs extremely slowly

（二）一个页面可能属于多个类别，我们的方法不处理这样的情况。中 - 当然，一个简单的方法，以确保这将是继续查询类别找到匹配的数据库，甚至有一次，但这只会使事情变得更慢。

(ii) A page may belong to more than one category and our approach does not handle such cases. Of-course, one simple way to ensure this would be to continue querying the database even once a category match is found, but this would only make things even slower.

我在想的替代品，并想知道如果反向可以这样做 - 解析URL，言语中是存在的，然后在数据库中查询那些话只是

I was thinking of alternatives and was wondering if the reverse could be done - Parse the url, find words occuring within it and then query the database for those words only.

一个天真的算法，这将运行在O（N ^ 2） - 查询数据库的URL中出现的所有子。

A naive algorithm for this would run in O( n^2 ) - query the database for all substrings that occur within the url.

我想知道是否有任何更好的方法来做到这一点。有任何想法吗？？预先感谢您：）

I was wondering if there was any better approach to accomplish this. Any ideas ?? Thank you in advance :)

模式匹配的网址分类 [英] Pattern Matching for URL classification

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

模式匹配的网址分类 [英] Pattern Matching for URL classification

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭