删除红宝石中奇怪的无效字符 [英] Remove weird invalid character in ruby

查看:105
本文介绍了删除红宝石中奇怪的无效字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些XML内容(UTF-8),其中包含无效字符(nokogiri告诉我第2190行,语法错误:当我尝试解析时,PCDATA无效的Char值15 内容使用 Nokogiri :: XML(content))。

I have some XML content (UTF-8), that contains invalid characters (nokogiri tells me Line 2190, SyntaxError: PCDATA invalid Char value 15 when I try to parse the content with Nokogiri::XML(content)).

字符在Sublime Text编辑器中显示为 SI:

The character is displayed in Sublime Text editor as a "SI":

当我尝试复制字符时,不会复制任何内容,因此我什至无法查找它。例如,当我在Atom编辑器中打开它时,不会显示 SI。但是,当我用右键逐步浏览字符时,我必须键入两次才能越过放置 SI字符的位置。

When I try to copy the character, nothing gets copied, so I can't even look it up. When I open it for example in my Atom Editor, the "SI" is not displayed. However, when I step through the characters with the right key, I have to type twice to get over the place where the "SI" character is placed.

首先,什么这是什么样的性格?第二点:Ruby中是否有一种方法可以删除此类字符。我尝试使用 content.chars.select {| i | i.valid_encoding?}。加入,但不会删除字符。

First, what kind of character is this? And second: Is there a way in Ruby, to remove such characters. I tried it with content.chars.select{|i| i.valid_encoding?}.join but it doesn't remove the character.

更新

我通过用ruby读取原始文件找到了字符。字符为 \u000F \u000F .ord 返回字符代码 15 。关于 http://www.fileformat.info/info/unicode/char /000f/index.htm ,这是 SHIFT IN 字符。还有其他这样的人物吗?我可以使用 str.split( \u000F)。join 删除它们,但是如果还有其他这样的字符,这似乎不是一个好方法。有想法吗?

I found the character by reading the original file with ruby. The character is \u000F and "\u000F".ord returns the character code 15. Regarding http://www.fileformat.info/info/unicode/char/000f/index.htm this is a SHIFT IN character. Are there other characters like that? I could remove them by using str.split("\u000F").join, but if there are other characters like this, this seems like not a good approach. Any ideas?

推荐答案

如果字节序列实际上对编码(UTF-8)无效,那么在ruby 2.1+中,您可以使用String#scrub方法。默认情况下,它将用 unicode替换字符(通常在框中重新表示为问号)替换无效的字符,但是您也可以使用它完全删除它们。

If it were byte sequences actually invalid for the encoding (UTF-8), then in ruby 2.1+, you could use the String#scrub method. It will by default replace invalid chars with the "unicode replacement character" (usually represneted as a question mark in a box), but you can also use it to remove them entirely.

但是,正如您所注意到的,您的怪异字节实际上是有效的UTF-8,用Unicode代码点 \u000F来表示, SHIFT IN 控制字符。 (搞清楚所涉及的实际字节/字符,这是困难的部分!)

However, as you note, your 'weird byte' is actually valid UTF-8 represneting the unicode codepoint "\u000F", the SHIFT IN control character. (Good job figuring out the actual bytes/character involved, that's the hard part!)

所以我们必须弄清楚这样的字符的含义,如果我们要删除它们。

So we have to be clear about what we mean by "characters like that", if we want to remove them. Characters like what?

Nokogiri抱怨说它在XML PCDATA(已解析字符数据)区域中无效。为什么它是合法的unicode / UTF-8,但在XML PCDATA中无效? XML字符数据中哪些是合法的?我试图弄清楚,但与规范显然是说有些字符被弄糊涂了(是什么?),使我眼中的事物与其他事物矛盾。

Nokogiri is complaining that it's invalid in an XML "PCDATA" (Parsed Character Data) area. Why would it be legal unicode/UTF-8, but invalid in XML PCDATA? What is legal in XML character data? I tried to figure it out, but it gets confusing, with the spec apparently saying that some characters are 'discouraged' (what?), and making what are to my eyes contradictory statements about other things.

我不确定Nokogiri到底要从PCData中禁止使用哪些字符,我们必须查看Nokogiri源(或更可能是libxml源),或者尝试提出一个有关对nokogiri / libxml来源了解更多的人的问题。

I'm not sure exactly what characters Nokogiri will disallow from PCData, we'd have to look at the Nokogiri source (or more likely the libxml source), or try to ask a question of someone who knows more about nokogiri/libxml's source.

但是, \u000F是控制字符,不太可能希望在XML字符数据中使用控制字符(除非您知道这样做),并且XML规范似乎不鼓励使用控制字符(显然Nokogiri / libxml实际上不允许使用这些字符?)。因此,解释这样的字符的一种方法是控制字符。

However, "\u000F" is a "control character", it's unlikely you want control characters in your XML character data (unless you know you do), and the XML spec seems to discourage control characters (and apparently Nokogiri/libxml actually disallows them?). So one way to interpret "characters like this" is "control characters".

您可以使用此正则表达式从字符串中删除所有控制字符,例如:

You can remove all control characters from a string with this regex, for example:

"Some string \u000F more".gsub(/[\u0001-\u001A]/ , '') # remove control chars, unicode codepoints from 0001 to 001A
   # => "Some string  more"

如果我们将这样的字符解释为任何不打印的字符-比控制字符更广泛的类别,其中包括nokogiri完全没有问题的类别。我们可以尝试使用ruby对正则表达式中的unicode字符类的支持来去除控制字符以外的内容:

If we interpret "characters like this" as any character that doesn't print -- a wider category than "control characters", and will include some that nokogiri has no problem with at all. We can try to remove a bit more than just control characters by using ruby's support for unicode character classes in regexes:

some_string.gsub(/[^[:print:]]/ , '')

[:print] 被模糊地记录为不包含控制字符,并且类似,因此这与我们对要执行的操作含糊不清。 :)

[:print] is documented rather vaguely as "excludes control characters, and similar", so that's kind of a match for our vague spec of what we want to do. :)

所以,这实际上取决于我们所说的像这样的字符的含义。确实,对于您的情况,这样的字符可能意味着 Nokogiri / libxml将拒绝允许的任何字符,而且我恐怕还没有真正回答那个问题,因为我不确定,也无法轻松解决。但是在很多情况下,删除控制字符,甚至最好删除与 [:print] 不匹配的字符,除非您有理由想要控制保留chars和类似字符(例如,如果您知道需要将它们用作记录分隔符)。

So it really depends on what we mean by "characters like this". Really, "characters like this" for your case probably means "any char that Nokogiri/libxml will refuse to allow", and I'm afraid I haven't actually answered that question, because I'm not sure and was not able to easily figure it out. But for many cases, removing control chars, or even better removing chars that don't match [:print] will probably do just fine, unless you have a reason to want control chars and similar to remain (if you knew you needed them as record separators, for instance).

如果要删除而不是将其替换为unicode替换字符,则该字符通常用于表示我们无法处理的字节序列: / p>

If instead of removing, you wanted to replace them with the unicode replacement char, which is commonly used to stand in for "byte sequence we couldn't handle":

"Shift in: \u000F".gsub(/[^[:print:]]/, "\uFFFD")
   # => "Shift in: �"

如果不想删除它们,而是希望以某种方式对其进行转义,可以可以在XML解析后重新构建。...再问一遍,我会弄清楚的,但是我现在还没有。 :)

If instead of removing them you want to escape them in some way they can be reconstructed after XML parsing.... ask again with that and I'll figure it out, but I haven't yet now. :)

欢迎使用字符编码问题,它的确会引起混淆。

Welcome to dealing with character encoding issues, it sure does get confusing sometimes.

这篇关于删除红宝石中奇怪的无效字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆