处理/删除UTF-8的从右到左覆盖字符的最佳方法是什么? [英] What is the best way to handle/remove, UTF-8's Right-to-left-override characters?

查看:85
本文介绍了处理/删除UTF-8的从右到左覆盖字符的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个utf-8字符(HEX字节E2 80 AE),当由启用utf-8的系统正确处理时,当显示给用户时,该字符将显示部分颠倒的字符.蛇常用于隐藏或弄乱文件扩展名.

There is a utf-8 character (HEX bytes E2 80 AE) that when correctly handled by utf-8 enabled systems will show the ascetically reversed chars, when displayed to the user. Commonly used by snakes to hide or mess with file extensions.

以下是此类文件名字符串的示例:

Here is an examples of such filename strings:

an .EXE called: EvilFile‮.EXE

an .scr called: yo.na‮.scr

如果执行扩展名验证,将不会有问题,而是显示这样的字符串会导致问题,htmlentities()会导致该字符串变为:EvilFile®.EXE

Filename extension validation would not be a problem if done, it would be the displaying of such string that would cause a problem, htmlentities() causes the string to become: EvilFile�.EXE

那么,将文件名修复回EvilFile.EXE的最佳解决方案是什么?

So, what would be the best solution to fix the filename back to EvilFile.EXE?

使用iconv进行的测试在输出时会产生相同类型的编码问题.

Tests ive done with iconv produce the same kind of encode problems on output.

<!DOCTYPE html>
<head>
    <meta charset="utf-8"> 
    <title></title>
</head>

<body>
<?php
$evilString = "EvilFile‮.EXE";
$ret = null;

$ret .= '<h1>htmlentities/ENT_QUOTES | ENT_IGNORE</h1>';
$ret .= htmlentities($evilString, ENT_QUOTES | ENT_IGNORE, "UTF-8").'<br>';

//enc options
$enc = array(
    "UTF-8", 
    "ASCII", 
    "Windows-1252", 
    "ISO-8859-15", 
    "ISO-8859-1", 
    "ISO-8859-6", 
    "CP1256",
    "US-ASCII//TRANSLIT", 
    "UTF-8//IGNORE",
    "UTF-8//TRANSLIT"
 );

//iconv
foreach ($enc as $i) {
    $ret .= '<h1>iconv/'.$i.'</h1>';
    foreach ($enc as $j) {
        $ret .= " $i - $j: ".@iconv($i, $j, $evilString).'<br>';
    }
}

//mb_convert_encoding
$ret .= '<h1>mb_convert_encoding</h1>';
foreach (mb_list_encodings() as $chr) {
    $ret .= $chr.' - '.mb_convert_encoding($evilString, 'UTF-8', $chr)."<br>";   
} 

echo $ret;
?> 
</body>
</html>

结果

iconv/US-ASCII//TRANSLIT
------------------------
US-ASCII//TRANSLIT - UTF-8: EvilFile
US-ASCII//TRANSLIT - ASCII: EvilFile
US-ASCII//TRANSLIT - Windows-1252: EvilFile
US-ASCII//TRANSLIT - ISO-8859-15: EvilFile
US-ASCII//TRANSLIT - ISO-8859-1: EvilFile
US-ASCII//TRANSLIT - ISO-8859-6: EvilFile
US-ASCII//TRANSLIT - CP1256: EvilFile
US-ASCII//TRANSLIT - US-ASCII//TRANSLIT: EvilFile
US-ASCII//TRANSLIT - UTF-8//IGNORE: EvilFile.EXE <<< - See answer below
US-ASCII//TRANSLIT - UTF-8//TRANSLIT: EvilFile

iconv/UTF-8//IGNORE
-------------------
UTF-8//IGNORE - UTF-8: EvilFile‮.EXE
UTF-8//IGNORE - ASCII: EvilFile
UTF-8//IGNORE - Windows-1252: EvilFile
UTF-8//IGNORE - ISO-8859-15: EvilFile
UTF-8//IGNORE - ISO-8859-1: EvilFile
UTF-8//IGNORE - ISO-8859-6: EvilFile
UTF-8//IGNORE - CP1256: EvilFile
UTF-8//IGNORE - US-ASCII//TRANSLIT: EvilFile
UTF-8//IGNORE - UTF-8//IGNORE: EvilFile‮.EXE
UTF-8//IGNORE - UTF-8//TRANSLIT: EvilFile‮.EXE

iconv/UTF-8//TRANSLIT
---------------------
UTF-8//TRANSLIT - UTF-8: EvilFile‮.EXE
UTF-8//TRANSLIT - ASCII: EvilFile
UTF-8//TRANSLIT - Windows-1252: EvilFile
UTF-8//TRANSLIT - ISO-8859-15: EvilFile
UTF-8//TRANSLIT - ISO-8859-1: EvilFile
UTF-8//TRANSLIT - ISO-8859-6: EvilFile
UTF-8//TRANSLIT - CP1256: EvilFile
UTF-8//TRANSLIT - US-ASCII//TRANSLIT: EvilFile
UTF-8//TRANSLIT - UTF-8//IGNORE: EvilFile‮.EXE
UTF-8//TRANSLIT - UTF-8//TRANSLIT: EvilFile‮.EXE

mb_convert_encoding
-------------------
pass - EvilFileâ®.EXE
auto - EvilFile‮.EXE
wchar - EvilFileâ®.EXE
byte2be - 䕶楬䙩汥긮䕘
byte2le - 癅汩楆敬胢⺮塅
byte4be - ������������?
byte4le - ������������������
BASE64 - ��)^q
UUENCODE -
HTML-ENTITIES - EvilFileâ®.EXE
Quoted-Printable - EvilFile‮.EXE
7bit - EvilFileâ®.EXE
8bit - EvilFileâ®.EXE
UCS-4 - ������������?
UCS-4BE - ������������?
UCS-4LE - ������������������
UCS-2 - 䕶楬䙩汥긮䕘
UCS-2BE - 䕶楬䙩汥긮䕘
UCS-2LE - 癅汩楆敬胢⺮塅
UTF-32 - ?
UTF-32BE - ?
UTF-32LE -
UTF-16 - 䕶楬䙩汥긮䕘
UTF-16BE - 䕶楬䙩汥긮䕘
UTF-16LE - 癅汩楆敬胢⺮塅
UTF-8 - EvilFile‮.EXE
UTF-7 - EvilFile???.EXE
UTF7-IMAP - EvilFile???.EXE
ASCII - EvilFileâ®.EXE
EUC-JP - EvilFile??EXE
SJIS - EvilFile窶ョ.EXE
eucJP-win - EvilFile??EXE
SJIS-win - EvilFile窶ョ.EXE
CP932 - EvilFile窶ョ.EXE
CP51932 - EvilFile??EXE
JIS - EvilFile??ョ.EXE
ISO-2022-JP - EvilFile??ョ.EXE
ISO-2022-JP-MS - EvilFile??ョ.EXE
Windows-1252 - EvilFile‮.EXE
Windows-1254 - EvilFile‮.EXE
ISO-8859-1 - EvilFileâ®.EXE
ISO-8859-2 - EvilFileâŽ.EXE
ISO-8859-3 - EvilFileâ?.EXE
ISO-8859-4 - EvilFileâŽ.EXE
ISO-8859-5 - EvilFileтЎ.EXE
ISO-8859-6 - EvilFileق?.EXE
ISO-8859-7 - EvilFileβ?.EXE
ISO-8859-8 - EvilFileג®.EXE
ISO-8859-9 - EvilFileâ®.EXE
ISO-8859-10 - EvilFileâŪ.EXE
ISO-8859-13 - EvilFileā®.EXE
ISO-8859-14 - EvilFileâ®.EXE
ISO-8859-15 - EvilFileâ®.EXE
ISO-8859-16 - EvilFileâ®.EXE
EUC-CN - EvilFile??EXE
CP936 - EvilFile鈥?EXE
HZ - EvilFile???.EXE
EUC-TW - EvilFile??EXE
BIG-5 - EvilFile??EXE
EUC-KR - EvilFile??EXE
UHC - EvilFile巽?EXE
ISO-2022-KR - EvilFile???.EXE
Windows-1251 - EvilFile‮.EXE
CP866 - EvilFileтАо.EXE
KOI8-R - EvilFileБ─╝.EXE
KOI8-U - EvilFileБ─╝.EXE
ArmSCII-8 - EvilFileՉ….EXE
CP850 - EvilFileÔÇ«.EXE
JIS-ms - EvilFile??ョ.EXE
CP50220 - EvilFile??ョ.EXE
CP50220raw - EvilFile??ョ.EXE
CP50221 - EvilFile??ョ.EXE
CP50222 - EvilFile??ョ.EXE

我想有(我不喜欢).通过 utf8_encode()传递字符串,然后通过 preg_replace()即可删除喜怒无常的字符.但是必须有更好/更清洁的方法.

I suppose there is (which im not keen on). Pass the string through utf8_encode() and then through preg_replace() to remove the moody chars. But there must be a better/cleaner way.

echo preg_replace('/[^a-z0-9_ \[\]\.\(\)#%&-]/si', '', utf8_encode($evilString)).'<br>';

推荐答案

在一些进一步的测试中,我添加了US-ASCII//TRANSLIT - UTF-8//IGNORE,以便在不使用正则表达式的情况下修复这些类型的字符串,您可以使用:

Upon some further tests I added US-ASCII//TRANSLIT - UTF-8//IGNORE so to fix these types of strings without using regex you would use:

echo iconv('US-ASCII//TRANSLIT', 'UTF-8//IGNORE', $evilString); //EvilFile.EXE

希望这对以后遇到这个独特问题的人有帮助.

Hope this helps anyone in the future with this unique problem.

这篇关于处理/删除UTF-8的从右到左覆盖字符的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆