如何分类URL? URL有什么功能?如何从URL选择和提取功能 [英] How to classify URLs? what are URLs features? How to select and Extract features from URL

查看:102
本文介绍了如何分类URL? URL有什么功能?如何从URL选择和提取功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始处理分类问题.这是一个两类问题,我训练有素的模型(机器学习)"将不得不决定是否允许URL或阻止它.

I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block it.

我的问题非常具体.

  1. 如何对URL进行分类?我应该使用普通的文本分析方法吗?
  2. 什么是URL功能?
  3. 如何从URL中选择和提取功能?

推荐答案

我假设您无权访问URL的内容,因此只能从url字符串本身提取功能.否则,使用URL的内容会更有意义.

I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.

以下是我会尝试的一些功能.有关更多建议,请参见论文:

Here are some features I will try. See this paper for more ideas:

  1. 所有url组件.例如,此页面具有以下网址:

  1. All url components. For example, this page has the below url:

http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features

出现在URL不同部分的所有标记对分类应具有可变值.在这种情况下,标记化后的最后一部分为该页面提供了出色的功能. (例如,分类,网址,选择,提取,功能)

All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)

 * stackoverflow
 * com
 * questions
 * 26456904
 * how to classify urls what are urls features how to select and extract features

  1. URL的长度;
  2. n克(以下为2克示例)
    • stackoverflow-com
    • com-questions
    • questions-26456904
    • 26456904-如何
    • 操作方法
    • ....
  1. The length of a url;
  2. n-grams (2-grams as examples below)
    • stackoverflow-com
    • com-questions
    • questions-26456904
    • 26456904-how
    • how-to
    • ....

这篇关于如何分类URL? URL有什么功能?如何从URL选择和提取功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆