仅解码 URL 非 ASCII 字符 [英] decode URL only non-ascii character

查看：60 发布时间：2021/6/26 19:09:57 python python-2.7 urldecode

本文介绍了仅解码 URL 非 ASCII 字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现在我正在研究维基百科.在很多文章中，我注意到一些网址，例如 https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99，很长.示例 URL 可以替换为https://www.google.com/search?q=%26ฉัน"(ฉัน 是一个泰语词)，它更短更简洁.但是，当我使用 urllib.unquote 函数解码 URL 时，它甚至可以解码 %26 并得到https://www.google.com/search?q=&ฉัน"作为结果.您可能已经注意到，这个 URL 没有用；它没有建立有效的链接.

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.

因此，我想知道如何在有效时获取解码链接.我认为只解码非 ascii 字符会得到有效的 URL.这是正确的吗?以及如何做到这一点?

Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?

谢谢:)

推荐答案

最简单的方法，你可以用一些占位符替换 %80 (%00-%7F) 以下的所有 URL 编码序列，做一个 URL 解码，并替换原来的URL 编码序列回到占位符.

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.

另一种方法是查找 UTF-8 序列.您的 URL 似乎以 UTF-8 编码，而维基百科使用 UTF-8.您可以查看 UTF-8 的维基百科条目，了解 UTF-8 字符的编码方式.

Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.

因此，当在 URL 中编码时，每个有效的非 ASCII UTF-8 字符都将遵循以下模式之一:

So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:

(%C0-%DF)(%80-%BF)
(%E0-%EF)(%80-%BF)(%80-%BF)
(%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
(%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
(%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)

因此您可以匹配 URL 中的这些模式并分别取消每个字符的引号.

So you can match these patterns in the URL and unquote each character separately.

但是，请记住，并非所有网址都以 UTF-8 编码.

However, remember that not all URLs are encoded in UTF-8.

在一些旧网站中，他们仍然使用其他字符集，例如泰语的 Windows-874.

In some old websites, they still use other character sets, such as Windows-874 for Thai language.

在这种情况下，该特定网站的ฉัน"被编码为%A9%D1%B9"而不是%E0%B8%89%E0%B8%B1%E0%B8%99".如果你使用 urllib.unquote 解码它，你会得到一些乱码，比如?ѹ"而不是ฉัน"，这可能会破坏链接.

In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.

所以你必须小心并检查 URL 解码是否破坏了链接.确保您正在解码的网址采用 UTF-8 格式.

So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

这篇关于仅解码 URL 非 ASCII 字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

仅解码 URL 非 ASCII 字符 [英] decode URL only non-ascii character

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

仅解码 URL 非 ASCII 字符 [英] decode URL only non-ascii character

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭