如何在Java中规范化URL? [英] How to normalize a URL in Java?
问题描述
网址规范化(或网址规范化)是指以一致的方式修改和标准化哪些URL。规范化过程的目标是将URL转换为规范化或规范化URL,以便确定两个语法上不同的URL是否相同。
URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.
策略包括添加尾部斜杠,https => http等。维基百科页面列出了很多。
Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.
在Java中有一个最喜欢的方法吗?也许是一个图书馆( Nutch ?) ,但我很开放。更小和更少的依赖关系更好。
Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.
我现在手动编码并密切注意这个问题。
I'll handcode something for now and keep an eye on this question.
编辑:如果他们引用相同的内容,我想积极规范化以统计URL。例如,我忽略了参数utm_source,utm_medium,utm_campaign。例如,如果标题相同,我会忽略子域。
EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.
推荐答案
您是否看过URI类?
Have you taken a look at the URI class?
http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()
这篇关于如何在Java中规范化URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!