如何在不使用 BOM 的情况下识别不同的编码? [英] How can I identify different encodings without the use of a BOM?

查看：30 发布时间：2021/9/15 19:39:28 java utf-8 utf-16 byte-order-mark

本文介绍了如何在不使用 BOM 的情况下识别不同的编码?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件观察器，它正在从使用 utf-16LE 编码的不断增长的文件中抓取内容.写入它的第一位数据有可用的 BOM——我用它来识别针对 UTF-8 的编码(我输入的大多数文件都被编码).我捕获了 BOM 并重新编码为 UTF-8，这样我的解析器就不会吓坏了.问题在于，由于它是一个不断增长的文件，并非每一位数据都包含 BOM.

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.

这是我的问题——无需在我拥有的每组数据前添加 BOM 字节(因为我无法控制源)，我可以只查找固有的空字节吗在 UTF-16 \000 中，然后使用它作为我的标识符而不是 BOM?这会让我头疼吗?

Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?

我的架构涉及一个 ruby Web 应用程序，当我用 Java 编写的解析器接收到它时，它会将接收到的数据记录到一个临时文件中.

My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.

现在编写我的识别/重新编码代码如下:

Write now my identification/re-encoding code looks like this:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

更新

我想支持欧元、破折号和其他字符之类的东西.我修改了上面的代码看起来像这样，它似乎通过了我所有的测试对于这些字符:

I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

大家怎么看?

如何在不使用 BOM 的情况下识别不同的编码? [英] How can I identify different encodings without the use of a BOM?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何在不使用 BOM 的情况下识别不同的编码? [英] How can I identify different encodings without the use of a BOM?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭