清理 R 中的字符串 [英] Sanitising strings in R

查看:47
本文介绍了清理 R 中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与上一个问题有关,这里是:将 \u 转义的 Unicode 字符串转换为 ASCII

This is related to a previous question, here: Converting a \u escaped Unicode string to ASCII

我提出了一个涉及 eval(parse(text=x)) 的解决方案,对于非 R 用户来说,这意味着它所说的:解析文本字符串,然后对其进行评估.其目的是不是允许执行任意代码,而只是取消转义转义的 Unicode 文本.因此解决方案:

I proposed a solution involving eval(parse(text=x)), which for non-R users, means what it says: parsing the text string, then evaluating it. The aim was not to allow arbitrary code to be executed, but only to un-escape escaped Unicode text. Hence the solution:

eval(parse(text=paste0("'", x, "'")))

虽然考虑到有限的目标,这应该是相当安全的,但我很想知道:需要多少消毒才能保证安全?

While this should be fairly safe given the restricted objective, I'd be interested to know: how much sanitisation is required to keep things safe?

至少,我猜任何嵌入的单引号和双引号都必须转义.例如,假设我们有

At a minimum, I guess any embedded single and double quotes have to be escaped. For example, suppose we have

x <- "this is a '; print(dir()); 'string"

然后 eval 根据上面的代码片段执行此操作将执行中间的代码.所以我们必须转义引号:

Then eval'ing this per the snippet above would execute the code in the middle. So we have to escape the quotes:

eval(parse(text=paste0("'",
                       gsub("'", "\\\\'", x),
                       "'")))

双引号也是如此.我不认为 unescaped Unicode 等价物 \u0022\u0027 是一个问题,因为对于解析器它们将与普通的相同"'.

And similarly for double quotes. I don't think the unescaped Unicode equivalents \u0022 and \u0027 are a problem, since to the parser they'll be identical to plain " and '.

这种方法有没有我遗漏的漏洞?

Are there any holes in this approach that I've missed?

推荐答案

this is a \'; print(dir()); 'string

被转义为:

'this is a \\'; print(dir()); 'string'

双反斜杠被评估为文字反斜杠,引用有效,代码被执行.

double-backslash is evaled as literal backslash, quote is active, code is executed.

我也不知道 R,但可能你至少可以使用原始控制字符(如换行符或无效转义符)导致崩溃.

Also I don't know about R but probably you could at minimum cause a crash using raw control characters like newline or invalid escapes.

eval 总的来说是一个杯子游戏.正常的字符串处理(搜索您想要的序列的字符串,替换它)是更好的方法,并且使用现有的库来处理特定的正确指定的格式是最好的.例如,如果您有 JSON,请使用 JSON 解析器.有许多可能的字符串文字格式使用 \u 转义,所有规则都略有不同,因此您需要正确选择确切的格式.

eval is a mug's game in general. Normal string handling (search string for the sequence you want, replacing it) is the better approach, and using an existing library for a particular properly-specified format is best of all. For example if you have JSON, use a JSON parser. There are many possible string literal formats that use \u escapes, all with slightly different rules, so you will want to choose the exact format correctly.

这篇关于清理 R 中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆