用于生成slugs的Java代码/库(用于漂亮的URL) [英] Java code/library for generating slugs (for use in pretty URLs)
问题描述
Rails和Django等Web框架内置了对slugs的支持,用于生成可读和SEO友好的URL:
Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:
- Slugs in Rails
- Slugs in Django
一个slug字符串通常只包含字符 az
, 0-9
和 -
因此可以在没有URL转义的情况下编写(想想foo%20bar)。
A slug string typically contains only of the characters a-z
, 0-9
and -
and can hence be written without URL-escaping (think "foo%20bar").
我正在寻找Java给出任何有效Unicode字符串的slug函数将返回一个slug表示( az
, 0-9
和 -
)。
I'm looking for a Java slug function that given any valid Unicode string will return a slug representation (a-z
, 0-9
and -
).
一个微不足道的slug函数将是这样的:
A trivial slug function would be something along the lines of:
return input.toLowerCase().replaceAll("[^a-z0-9-]", "");
但是,此实现不会处理国际化和重音(ë
> e
)。解决这个问题的一种方法是列举所有特殊情况,但这不是很优雅。我正在寻找更深思熟虑的东西。
However, this implementation would not handle internationalization and accents (ë
> e
). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.
我的问题:
- 在Java中生成Django / Rails类型slugs的最通用/最实用的方法是什么?
- What is the most general/practical way to generate Django/Rails type slugs in Java?
推荐答案
使用规范分解规范化您的字符串:
Normalize your string using canonical decomposition:
private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
private static final Pattern WHITESPACE = Pattern.compile("[\\s]");
public static String toSlug(String input) {
String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
String normalized = Normalizer.normalize(nowhitespace, Form.NFD);
String slug = NONLATIN.matcher(normalized).replaceAll("");
return slug.toLowerCase(Locale.ENGLISH);
}
但这仍是一个相当天真的过程。对于s-sharp(ß - 用于德语)或任何非拉丁字母(希腊语,西里尔语,CJK等),它不会做任何事情。
This is still a fairly naive process, though. It isn't going to do anything for s-sharp (ß - used in German), or any non-Latin-based alphabet (Greek, Cyrillic, CJK, etc).
更改字符串大小写时要小心。大写和小写形式取决于字母表。在土耳其语中,U + 0069( i )的大写是U + 0130(İ ),而不是U + 0049(我 )如果您在土耳其语语言环境下使用 String.toLowerCase()
,则可能会冒险将非latin1字符引入字符串。
Be careful when changing the case of a string. Upper and lower case forms are dependent on alphabets. In Turkish, the capitalization of U+0069 (i) is U+0130 (İ), not U+0049 (I) so you risk introducing a non-latin1 character back into your string if you use String.toLowerCase()
under a Turkish locale.
这篇关于用于生成slugs的Java代码/库(用于漂亮的URL)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!