如何在 Java 中规范化 URL? [英] How to normalize a URL in Java?

查看:24
本文介绍了如何在 Java 中规范化 URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

URL 规范化(或 URL 规范化)是 URL 被修改和标准化的过程一致的方式.规范化过程的目标是将 URL 转换为规范化或规范化的 URL,以便确定两个在语法上不同的 URL 是否等效.

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

策略包括添加尾部斜杠、https => http 等.维基百科页面列出了许多.

Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.

在 Java 中有一个最喜欢的方法吗?也许是一个图书馆(Nutch?),但我是开放的.依赖项越小越好.

Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.

我现在会手动编写一些代码并密切关注这个问题.

I'll handcode something for now and keep an eye on this question.

编辑:如果 URL 引用相同的内容,我想积极规范化以将 URL 计算为相同.例如,我忽略了参数utm_source、utm_medium、utm_campaign.例如,如果标题相同,我会忽略子域.

EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

推荐答案

你看过 URI 类了吗?

Have you taken a look at the URI class?

http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

这篇关于如何在 Java 中规范化 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆