Java中用于非ASCII字符的URL解码 [英] URL decoding in Java for non-ASCII characters

查看:70
本文介绍了Java中用于非ASCII字符的URL解码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Java尝试解码包含%个编码字符的URL

I'm trying in Java to decode URL containing % encoded characters

我尝试使用java.net.URI类来完成这项工作,但是它并不总是能正常工作.

I've tried using java.net.URI class to do the job, but it's not always working correctly.

String test = "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise";
URI uri = new URI(test);
System.out.println(uri.getPath());

对于测试字符串" https://fr.wikipedia.org/wiki/Fondation_Alliance_fran %C3%A7aise ",结果是正确的"/wiki/Fondation_Alliance_française"(%C3%A7已正确替换为ç).

For the test String "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise", the result is correct "/wiki/Fondation_Alliance_française" (%C3%A7 is correctly replaced by ç).

但是对于其他一些测试字符串,例如" http://sv .wikipedia.org/wiki/Anv%E4ndare:Lsjbot/Statistik#Drosophilidae ",它给出了不正确的结果"/wiki/Anv.ndare:Lsjbot/Statistik"(%E4被替换为代替ä)

But for some other test strings, like "http://sv.wikipedia.org/wiki/Anv%E4ndare:Lsjbot/Statistik#Drosophilidae", it gives an incorrect result "/wiki/Anv�ndare:Lsjbot/Statistik" (%E4 is replaced by � instead of ä).

我对getRawPath()和URLDecoder类进行了一些测试.

I did some testing with getRawPath() and URLDecoder class.

System.out.println(URLDecoder.decode(uri.getRawPath(), "UTF8"));
System.out.println(URLDecoder.decode(uri.getRawPath(), "ISO-8859-1"));
System.out.println(URLDecoder.decode(uri.getRawPath(), "WINDOWS-1252"));

根据测试字符串,使用不同的编码可以获得正确的结果:

Depending on the test String, I get correct results with different encodings:

  • 对于%C3%A7,按预期,使用"UTF-8"编码可获得正确的结果,而使用"ISO-8859-1"或"WINDOWS-1252"编码可获得不正确的结果
  • 对于%E4,情况恰恰相反.

对于两个测试URL,如果将它们放在Chrome地址栏中,则会显示正确的页面.

For both test URL, I get the correct page if I put them in Chrome address bar.

如何在所有情况下正确解码URL? 感谢您的帮助

How can I correctly decode the URL in all situations ? Thanks for any help

====答案====

==== Answer ====

由于下面的麦克道威尔(McDowell)回答中的建议,它现在似乎可以工作了.这是我现在拥有的代码:

Thanks to the suggestions in McDowell answer below, it now seems to work. Here's what I now have as code:

private static void appendBytes(ByteArrayOutputStream buf, String data) throws UnsupportedEncodingException {
  byte[] b = data.getBytes("UTF8");
  buf.write(b, 0, b.length);
}

private static byte[] parseEncodedString(String segment) throws UnsupportedEncodingException {
  ByteArrayOutputStream buf = new ByteArrayOutputStream(segment.length());
  int last = 0;
  int index = 0;
  while (index < segment.length()) {
    if (segment.charAt(index) == '%') {
      appendBytes(buf, segment.substring(last, index));
      if ((index < segment.length() + 2) &&
          ("ABCDEFabcdef0123456789".indexOf(segment.charAt(index + 1)) >= 0) &&
          ("ABCDEFabcdef0123456789".indexOf(segment.charAt(index + 2)) >= 0)) {
        buf.write((byte) Integer.parseInt(segment.substring(index + 1, index + 3), 16));
        index += 3;
      } else if ((index < segment.length() + 1) &&
                 (segment.charAt(index + 1) == '%')) {
        buf.write((byte) '%');
        index += 2;
      } else {
        buf.write((byte) '%');
        index++;
      }
      last = index;
    } else {
      index++;
    }
  }
  appendBytes(buf, segment.substring(last));
  return buf.toByteArray();
}

private static String parseEncodedString(String segment, Charset... encodings) {
  if ((segment == null) || (segment.indexOf('%') < 0)) {
    return segment;
  }
  try {
    byte[] data = parseEncodedString(segment);
    for (Charset encoding : encodings) {
      try {
        if (encoding != null) {
          return encoding.newDecoder().
              onMalformedInput(CodingErrorAction.REPORT).
              decode(ByteBuffer.wrap(data)).toString();
        }
      } catch (CharacterCodingException e) {
        // Incorrect encoding, try next one
      }
    }
  } catch (UnsupportedEncodingException e) {
    // Nothing to do
  }
  return segment;
}

推荐答案

Anv%E4ndare

Anv%E4ndare

PopoFibo说,这不是有效的UTF-8编码序列.

As PopoFibo says this is not a valid UTF-8 encoded sequence.

您可以进行一些宽容的最佳猜测解码:

You can do some tolerant best-guess decoding:

public static String parse(String segment, Charset... encodings) {
  byte[] data = parse(segment);
  for (Charset encoding : encodings) {
    try {
      return encoding.newDecoder()
          .onMalformedInput(CodingErrorAction.REPORT)
          .decode(ByteBuffer.wrap(data))
          .toString();
    } catch (CharacterCodingException notThisCharset_ignore) {}
  }
  return segment;
}

private static byte[] parse(String segment) {
  ByteArrayOutputStream buf = new ByteArrayOutputStream();
  Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])")
                          .matcher(segment);
  int last = 0;
  while (matcher.find()) {
    appendAscii(buf, segment.substring(last, matcher.start()));
    byte hex = (byte) Integer.parseInt(matcher.group(1), 16);
    buf.write(hex);
    last = matcher.end();
  }
  appendAscii(buf, segment.substring(last));
  return buf.toByteArray();
}

private static void appendAscii(ByteArrayOutputStream buf, String data) {
  byte[] b = data.getBytes(StandardCharsets.US_ASCII);
  buf.write(b, 0, b.length);
}

此代码将成功解码给定的字符串:

This code will successfully decode the given strings:

for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise",
    "Anv%E4ndare")) {
  String result = parse(test, StandardCharsets.UTF_8,
      StandardCharsets.ISO_8859_1);
  System.out.println(result);
}

请注意,这不是一个万无一失的系统,它允许您忽略正确的URL编码.之所以在这里起作用,是因为 v%E4n -字节序列76 E4 6E-根据 UTF-8方案,解码器可以检测到.

Note that this isn't some foolproof system that allows you to ignore correct URL encoding. It works here because v%E4n - the byte sequence 76 E4 6E - is not a valid sequence as per the UTF-8 scheme and the decoder can detect this.

如果您反转编码顺序,则第一个字符串可以愉快地(但错误地)被解码为ISO-8859-1.

If you reverse the order of the encodings the first string can happily (but incorrectly) be decoded as ISO-8859-1.

注意: HTTP不在意关于百分比编码,您可以编写一个接受http://foo/%%%%%作为有效格式的Web服务器. URI规范强制使用UTF-8,但是这是追溯完成的.确实要由服务器来描述其URI应该采用的形式,并且如果您必须处理任意URI,则需要了解这一传统.

Note: HTTP doesn't care about percent-encoding and you can write a web server that accepts http://foo/%%%%% as a valid form. The URI spec mandates UTF-8 but this was done retroactively. It is really up to the server to describe what form its URIs should be and if you have to handle arbitrary URIs you need to be aware of this legacy.

我写了此处有关URL和Java的更多信息.

这篇关于Java中用于非ASCII字符的URL解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆