正则表达式，用于匹配Unicode模式 [英] Regex for matching Unicode pattern

查看：90 发布时间：2021/5/18 19:34:58 java regex

本文介绍了正则表达式，用于匹配Unicode模式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

上载时，我试图验证文件的内容，但我陷入了Unicode编码的困境.我不希望找到不在ASCII范围内的Unicode特殊字符.我正在尝试查找文件的内容是否包含至少一个Unicode模式，例如\ u0046.

I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like \u0046 for example.

例如，我排除了包含脚本"字词的任何文件，但是如果文件包含以Unicode编写的字词怎么办?当然，Java在读取内容时会将其解码为普通字符串，但是如果我不能依靠它呢?

For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?

因此，就我在Internet上进行的搜索而言，我已经看到Unicode字符写为\ u0046或U + 0046.基于此，我编写了以下正则表达式:

So, as far as I have searched on the Internet, I've seen Unicode characters written like \u0046, or like U+0046. Based on this, I have written the following regex:

(\\u|U\+)....

这表示\ u或U +，后跟四个字符.这种模式满足了我的期望，但是我想知道是否还有其他方式可以编写Unicode字符.它始终是\ u还是U +?\ u或U +后可以少于或少于4个字符吗?

This means, \u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always \u or U+? Can it be more or less than 4 characters after \u or U+?

谢谢

推荐答案

属于Unicode的U + 任何十六进制数字符号在代码中的任何地方都无法使用.在Java源代码和* .properties \ u 中，后跟四个十六进制数字的是自动解析的Unicode UTF-16编码.

The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties \u followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.

要搜索的模式:

"\\\\u[0-9A-Fa-f]{4}"

或一个字符串.包含在:

Or a String.contains on:

"\\u"

除了Java \ Uxxxxxx (六个十六进制字符)之外，还可以使用其他语言，用于完整的UTF-32范围.不幸的是，直到Java 8才不是.

In other languages than Java \Uxxxxxx (six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.

这篇关于正则表达式，用于匹配Unicode模式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

正则表达式，用于匹配Unicode模式 [英] Regex for matching Unicode pattern

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

正则表达式，用于匹配Unicode模式 [英] Regex for matching Unicode pattern

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭