如何清除ruby中的不可打印/不可见字符? [英] How to remove non-printable/invisible characters in ruby?

查看:158
本文介绍了如何清除ruby中的不可打印/不可见字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时我在字符串的中间有不可打印的字符。这些字符串是用户输入的,所以我必须使我的程序很好地接受,而不是尝试改变问题的根源。

Sometimes I have evil non-printable characters in the middle of a string. These strings are user input, so I must make my program receive it well instead of try to change the source of the problem.

例如,他们可以具有零宽度no-break space 中间的字符串。例如,在解析 .po 文件时,一个有问题的部分是字符串他是一个神的人在文件的中间。虽然一切似乎正确,用 irb 检查它显示:

For example, they can have zero width no-break space in the middle of the string. For example, while parsing a .po file, one problematic part was the string "he is a man of god" in the middle of the file. While it everything seems correct, inspecting it with irb shows:

 "he is a man of god".codepoints
 => [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100] 

我相信我知道一个 BOM 是什么,我甚至很好地处理。但是有时我在文件的中间有这样的字符,所以它不是一个 BOM

I believe that I know what a BOM is, and I even handle it nicely. However sometimes I have such characters on the middle of the file, so it is not a BOM.

我目前的做法是以一种非常臭的方式去除所有发现恶魔的角色:

My current approach is to remove all characters that I found evil in a really smelly fashion:

text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*")

最接近我的是这篇文章导致我在regexps上的:print:选项。但是对我来说并不好:

The most close I got was following this post which leaded me to :print: option on regexps. However it was no good for me:

"m".scan(/[[:print:]]/).join.codepoints
 => [65279, 109] 

所以问题是:如何删除所有不可打印红宝石的字符串中的字符?

推荐答案

Ruby可以帮助您从一个多字节字符集转换为另一个。查看这些搜索结果,并阅读Ruby String的 encode 方法。

Ruby can help you convert from one multi-byte character set to another. Check into the these search results, plus read up on Ruby String's encode method.

另外,Ruby的 Iconv 是你的朋友。

Also, Ruby's Iconv is your friend.

最后,詹姆斯·格雷写了一个系列文章,详细介绍了这一点。

Finally, James Grey wrote a series of articles which cover this in good detail.

使用这些工具可以做的其中一件事就是告诉他们转码为视觉上相似的角色,或完全忽略它们。

One of the things you can do using those tools is to tell them to transcode to a visually similar character, or ignore them completely.

处理替代字符集是我曾经做过的最令人烦恼的事情之一,因为文件可以包含任何内容,而是被标记为文本。您可能不会期望,然后您的代码死亡或开始抛出错误,因为人们在设计插入替换字符到内容的方法时非常巧妙。

Dealing with alternate character sets is one of the most... irritating things I've ever had to do, because files can contain anything, but be marked as text. You might not expect it and then your code dies or starts throwing errors, because people are so ingenious when coming up with ways to insert alternate characters into content.

这篇关于如何清除ruby中的不可打印/不可见字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆