如何在不使用 BOM 的情况下识别不同的编码? [英] How can I identify different encodings without the use of a BOM?

查看:30
本文介绍了如何在不使用 BOM 的情况下识别不同的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件观察器,它正在从使用 utf-16LE 编码的不断增长的文件中抓取内容.写入它的第一位数据有可用的 BOM——我用它来识别针对 UTF-8 的编码(我输入的大多数文件都被编码).我捕获了 BOM 并重新编码为 UTF-8,这样我的解析器就不会吓坏了.问题在于,由于它是一个不断增长的文件,并非每一位数据都包含 BOM.

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.

这是我的问题——无需在我拥有的每组数据前添加 BOM 字节(因为我无法控制源),我可以只查找固有的空字节吗在 UTF-16 \000 中,然后使用它作为我的标识符而不是 BOM?这会让我头疼吗?

Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?

我的架构涉及一个 ruby​​ Web 应用程序,当我用 Java 编写的解析器接收到它时,它会将接收到的数据记录到一个临时文件中.

My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.

现在编写我的识别/重新编码代码如下:

Write now my identification/re-encoding code looks like this:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

更新

我想支持欧元、破折号和其他字符之类的东西.我修改了上面的代码看起来像这样,它似乎通过了我所有的测试对于这些字符:

I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

大家怎么看?

推荐答案

通常情况下,您无法100% 准确地识别数据流的字符编码.您能做的最好的事情是尝试使用一组有限的预期编码进行解码,然后对解码结果应用一些启发式方法,看看它是否看起来像"预期语言中的文本.(但任何启发式方法都会为某些数据流提供误报和漏报.)或者,让人工参与来决定哪种解码最有意义.

In general, you cannot identify the character encoding of a data stream with 100% accuracy. The best you can do is try to decode using a limited set of expected encodings, and then apply some heuristics to the decoded result to see if it "looks like" text in the expected language. (But any heuristic will give false positives and false negatives for certain data streams.) Alternatively, put a human in the loop to decide which decoding makes the most sense.

更好的解决方案是重新设计您的协议,以便提供数据的任何内容也必须提供用于数据的编码方案.(如果你不能,责怪负责设计/实施无法为你提供编码方案的系统的人!).

A better solution is to to redesign your protocol so that whatever is supplying the data has to also supply the encoding scheme used for the data. (And if you cannot, blame whoever is responsible for designing / implementing the system that cannot give you an encoding scheme!).

根据您对问题的评论,数据文件是通过 HTTP 传送的.在这种情况下,您应该安排您的 HTTP 服务器 snarfs 发送数据的 POST 请求的内容类型"标头,从标头中提取字符集/编码,并将其保存在文件解析器可以使用的方式/位置处理.

from your comments on the question, the data files are being delivered via HTTP. In this case, you should arrange that your HTTP server snarfs the "content-type" header of the POST requests delivering the data, extract the character set / encoding from the header, and save it in a way / place that your file parser can deal with.

这篇关于如何在不使用 BOM 的情况下识别不同的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆