有没有办法清理“UTF-8中的无效字节序列”的文件Ruby中的错误? [英] Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

查看:191
本文介绍了有没有办法清理“UTF-8中的无效字节序列”的文件Ruby中的错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经尝试过一切,然后发布到StackOverflow
我真的希望有人可以帮助,但我非常绝望。



所以,我有一个服务,通过客户提供的XML Feed将数据上传到我们的数据库。通常,这些XML文件声称是UTF-8编码,但显然有相当多的无效字节序列。我可以清理这些文件,并将它们完全导入我们的数据库,只需在导入之前运行以下Linux命令:

  tr -cd '^ [:print:]'< original.xml> clean.xml 

只需运行这一个Linux命令,就可以将所有数据导入到我的数据库中诺基里在Rails。



问题是我们正在Heroku部署,我无法使用Linux命令预处理该文件。我花了上个星期在互联网上搜索基于Rails的本地解决方案来解决这个问题,但是他们都没有工作。在我浏览我尝试过的所有建议之前,这里是我的原始代码:

  data_source = ARGV [0] 
data_file = open data_source
data_string = data_file.read
doc = Nokogiri :: XML.parse(data_string)
doc.xpath(.// job)。 |
hash = node.element_children.each_with_object(Hash.new)do | e,h |
h [e.name.gsub(/ /,\"_\").strip.downcase.to_sym] = e.content
data.push(new)
end
end

在原始文件上运行此错误会导致错误:
UTF-8中的无效字节序列



以下是我尝试过的所有有用建议,但都失败了。


  1. 使用编码器



    Coder.clean!(data_string,UTF-8)


  2. 强制编码



    data_string.force_encoding('BINARY')。encode('UTF-8',:undef =>:replace,:replace =>' ')


  3. 转换为UTF-16并返回UTF-8



    data_string.encode! ('UTF-16','UTF-8',:invalid =>:replace,:replace =>'')
    data_string.encode!('UTF-8','UTF-16' p>


  4. 使用valid_encoding?



    data_string.chars.select {| i | i.valid_encoding?}。加入



    没有字符被删除;生成无效字节序列错误。


  5. 在打开文件时指定编码


我实际上写了一个可以尝试每个编码的函数,直到它可以打开文件没有错误并转换为UTF-8(@file_encodings是每个可能的文件编码的数组):

  @ file_encodings.each do | enc | 
print#{enc} ...
conv_str =r:#{enc}:utf-8
begin
data_file = File.open(fname,conv_str )
data_string = data_file.read
rescue
data_file = nil
data_string =
end
data_string = data_string.encode(enc,:invalid = >:replace,:undef =>:replace,:replace =>)

除非data_string.blank?打印\\\
#{enc} detected!\\\

return data_string
end




  1. 使用正则表达式删除不可打印的内容:



    data_string.gsub!(/ [^ [:print: ] /,)
    data_string.gsub!(/ [[:cntrl:]& [^ \\\
    \r]] /,)


(我还尝试了包含/ [^ a-zA-Z0-9〜`!@#$%^& *() = + [{]} \ |;:',<。> / \?] /)



对于上述所有内容,结果是相同的...在只读4400行后,发生无效字节序列错误或文件被中断。



那么为什么Linuxtr指挥工作完美,但没有任何建议可以在Rails中完成工作。



我最后做的是非常不合适,但完成了工作,我检查了每一行停止了Nokogiri(row.last),并寻找奇怪的字符,我发现每一个我添加到一个字符类,然后gsub!它像这样(控件char行动者不会在这里打印,但是你会得到这个想法):

  data_string.gsub!(/ [疯狂控制字符] ,)

但纯粹主义者坚持认为应该有一个更优雅,一般的解决方案。 / p>

(我缩进了我所有的代码四个空格,但是编辑器似乎没有选择这个。)

解决方案

Ruby 2.1有一个名为String.scrub的新方法,这正是您需要的。


如果字符串是无效的字节序列,则替换无效字节,
给定替换字符,否则返回自身。如果给定块,
将替换无效字节,返回值。


查看文档以获取更多信息。 / p>

http:// ruby-doc.org/core-2.1.0/String.html#method-i-scrub


I have tried everything before posting to StackOverflow I really hope someone can help, but I'm pretty desperate.

So, I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files claim to be UTF-8 encoded but clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:

tr -cd '^[:print:]' < original.xml > clean.xml

Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Rails.

The problem is that we're deploying on Heroku and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:

data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
  hash = node.element_children.each_with_object(Hash.new) do |e, h|
   h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
   data.push(newrow)
 end
end

Running this on the raw file produces an error: "Invalid byte sequence in UTF-8"

Here are all the helpful suggestions I've tried but all have failed.

  1. Use Coder

    Coder.clean!(data_string, "UTF-8")

  2. Force Encoding

    data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')

  3. Convert to UTF-16 and back to UTF-8

    data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '') data_string.encode!('UTF-8', 'UTF-16')

  4. Use valid_encoding?

    data_string.chars.select{|i| i.valid_encoding?}.join

    No characters are removed; generates "invalid byte sequence" errors.

  5. Specify encoding on opening the file

I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (@file_encodings is an array of every possible file encoding):

@file_encodings.each do |enc|
  print "#{enc}..."
  conv_str = "r:#{enc}:utf-8"
  begin
    data_file = File.open(fname, conv_str)
    data_string = data_file.read 
  rescue
    data_file = nil
    data_string = ""
  end
  data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")

  unless data_string.blank? print "\n#{enc} detected!\n"
  return data_string
end

  1. Use Regexp to remove non-printables:

    data_string.gsub!(/[^[:print:]]/,"") data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

(I also tried variants including /[^a-zA-Z0-9~`!@#$%^&*()-_=+[{]}\|;:'",<.>/\?]/)

For ALL of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.

So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Rails.

What I ended up doing is extremely inelegant, but gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):

data_string.gsub!(/[Crazy Control Characters]/,"")

But the purist in me insists there should be a more elegant, general solution.

(I've indented all my code four spaces, but the editor doesn't seem to be picking that up.)

解决方案

Ruby 2.1 has a new method called String.scrub which is exactly what you need.

If the string is invalid byte sequence then replace invalid bytes with given replacement character, else returns self. If block is given, replace invalid bytes with returned value of the block.

Check the docs for more info.

http://ruby-doc.org/core-2.1.0/String.html#method-i-scrub

这篇关于有没有办法清理“UTF-8中的无效字节序列”的文件Ruby中的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆