清理字符串以使它们的 URL 和文件名安全? [英] Sanitizing strings to make them URL and filename safe?

查看:31
本文介绍了清理字符串以使它们的 URL 和文件名安全?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提出一个函数,它可以很好地清理某些字符串,以便它们可以安全地在 URL 中使用(如 post slug),并且也可以安全地用作文件名.例如,当有人上传文件时,我想确保从名称中删除所有危险字符.

到目前为止,我想出了以下功能,希望能解决这个问题,并允许外部 UTF-8 数据.

/*** 将字符串转换为文件/URL 安全的slug"形式** @param string $string 要清理的字符串* @param bool $is_filename TRUE 将允许额外的文件名字符* @return 字符串*/函数清理($string = '',$is_filename = FALSE){//用破折号替换所有奇怪的字符$string = preg_replace('/[^w-'. ($is_filename ? '~_.' : ''). ']+/u', '-', $string);//一次只允许一个破折号分隔符(并使字符串小写)返回 mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');}

有没有人有任何我可以针对此运行的棘手样本数据 - 或者知道保护我们的应用免受恶意名称侵害的更好方法吗?

$is-filename 允许一些额外的字符,比如临时 vim 文件

更新:删除星号,因为我想不出有效的用法

解决方案

对您的解决方案的一些观察:

    模式末尾的
  1. 'u' 意味着 模式,而不是它匹配的文本将被解释为 UTF-8(我想你假设是后者?).
  2. w 匹配下划线字符.您专门将它包含在文件中,这会导致假设您不希望它们出现在 URL 中,但在代码中,您的 URL 将被允许包含下划线.
  3. 包含外国 UTF-8"似乎与语言环境有关.不清楚这是服务器的语言环境还是客户端的语言环境.来自 PHP 文档:

<块引用><块引用>

单词"字符是任何字母或数字或下划线字符,即可以是 Perl单词"一部分的任何字符.字母和数字的定义由 PCRE 的字符表控制,如果发生特定于语言环境的匹配,则可能会有所不同.例如,在fr"(法语)语言环境中,一些大于 128 的字符代码用于重音字母,这些字符代码与 w 匹配.

创建 slug

您可能不应该在您的 post slug 中包含重音等字符,因为从技术上讲,它们应该进行百分比编码(根据 URL 编码规则),这样您的 URL 就会很难看.

所以,如果我是你,在小写之后,我会将任何特殊"字符转换为它们的等效字符(例如 é -> e)并将非 [az] 字符替换为-",限制为单个字符的运行'-'正如你所做的那样.这里有一个转换特殊字符的实现:https://web.archive.org/web/20130208144021/http://neo22s.com/slug

一般消毒

OWASP 有他们的企业安全 API 的 PHP 实现,其中包括在您的应用程序中安全编码和解码输入和输出的方法.

Encoder 接口提供:

canonicalize (string $input, [bool $strict = true])decodeFromBase64(字符串 $input)decodeFromURL(字符串 $input)encodeForBase64 (string $input, [bool $wrap = false])encodeForCSS(字符串 $input)encodeForHTML(字符串 $input)encodeForHTMLAttribute(字符串 $input)encodeForJavaScript(字符串 $input)encodeForOS(编解码器 $codec,字符串 $input)encodeForSQL(编解码器 $codec,字符串 $input)encodeForURL(字符串 $input)encodeForVBScript(字符串 $input)encodeForXML(字符串 $input)encodeForXMLAttribute(字符串 $input)encodeForXPath(字符串 $input)

https://github.com/OWASP/PHP-ESAPIhttps://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.

So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.

/**
 * Convert a string to the file/URL safe "slug" form
 *
 * @param string $string the string to clean
 * @param bool $is_filename TRUE will allow additional filename characters
 * @return string
 */
function sanitize($string = '', $is_filename = FALSE)
{
 // Replace all weird characters with dashes
 $string = preg_replace('/[^w-'. ($is_filename ? '~_.' : ''). ']+/u', '-', $string);

 // Only allow one dash separator at a time (and make string lowercase)
 return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}

Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?

$is-filename allows some additional characters like temp vim files

update: removed the star character since I could not think of a valid use

解决方案

Some observations on your solution:

  1. 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
  2. w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
  3. The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by w.

Creating the slug

You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug

Sanitization in general

OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

The Encoder interface provides:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

https://github.com/OWASP/PHP-ESAPI https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

这篇关于清理字符串以使它们的 URL 和文件名安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆