仅解码 URL 非 ASCII 字符 [英] decode URL only non-ascii character

查看:60
本文介绍了仅解码 URL 非 ASCII 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我正在研究维基百科.在很多文章中,我注意到一些网址,例如 https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99,很长.示例 URL 可以替换为https://www.google.com/search?q=%26ฉัน"(ฉัน 是一个泰语词),它更短更简洁.但是,当我使用 urllib.unquote 函数解码 URL 时,它甚至可以解码 %26 并得到https://www.google.com/search?q=&ฉัน"作为结果.您可能已经注意到,这个 URL 没有用;它没有建立有效的链接.

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.

因此,我想知道如何在有效时获取解码链接.我认为只解码非 ascii 字符会得到有效的 URL.这是正确的吗?以及如何做到这一点?

Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?

谢谢:)

推荐答案

最简单的方法,你可以用一些占位符替换 %80 (%00-%7F) 以下的所有 URL 编码序列,做一个 URL 解码,并替换原来的URL 编码序列回到占位符.

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.

另一种方法是查找 UTF-8 序列.您的 URL 似乎以 UTF-8 编码,而维基百科使用 UTF-8.您可以查看 UTF-8 的维基百科条目,了解 UTF-8 字符的编码方式.

Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.

因此,当在 URL 中编码时,每个有效的非 ASCII UTF-8 字符都将遵循以下模式之一:

So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:

  • (%C0-%DF)(%80-%BF)
  • (%E0-%EF)(%80-%BF)(%80-%BF)
  • (%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
  • (%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
  • (%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)

因此您可以匹配 URL 中的这些模式并分别取消每个字符的引号.

So you can match these patterns in the URL and unquote each character separately.

但是,请记住,并非所有网址都以 UTF-8 编码.

However, remember that not all URLs are encoded in UTF-8.

在一些旧网站中,他们仍然使用其他字符集,例如泰语的 Windows-874.

In some old websites, they still use other character sets, such as Windows-874 for Thai language.

在这种情况下,该特定网站的ฉัน"被编码为%A9%D1%B9"而不是%E0%B8%89%E0%B8%B1%E0%B8%99".如果你使用 urllib.unquote 解码它,你会得到一些乱码,比如?ѹ"而不是ฉัน",这可能会破坏链接.

In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.

所以你必须小心并检查 URL 解码是否破坏了链接.确保您正在解码的网址采用 UTF-8 格式.

So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

这篇关于仅解码 URL 非 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆