如何在Java中规范化URL? [英] How to normalize a URL in Java?

查看:142
本文介绍了如何在Java中规范化URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


网址规范化(或网址规范化)是指以一致的方式修改和标准化哪些URL。规范化过程的目标是将URL转换为规范化或规范化URL,以便确定两个语法上不同的URL是否相同。

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

策略包括添加尾部斜杠,https => http等。维基百科页面列出了很多。

Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.

在Java中有一个最喜欢的方法吗?也许是一个图书馆( Nutch ?) ,但我很开放。更小和更少的依赖关系更好。

Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.

我现在手动编码并密切注意这个问题。

I'll handcode something for now and keep an eye on this question.

编辑:如果他们引用相同的内容,我想积极规范化以统计URL。例如,我忽略了参数utm_source,utm_medium,utm_campaign。例如,如果标题相同,我会忽略子域。

EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

推荐答案

您是否看过URI类?

Have you taken a look at the URI class?

http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

这篇关于如何在Java中规范化URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆