如何从 Java 中的 XML 文件中删除 BOM [英] How to Remove BOM from an XML file in Java

查看:56
本文介绍了如何从 Java 中的 XML 文件中删除 BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要有关如何从 UTF-8 文件中删除 BOM 并创建其余 xml 文件副本的建议.

I need suggestions on the way to remove BOM from an UTF-8 file and create a copy of the rest of the xml file.

推荐答案

根据我的经验,由于 UTF-8 文件中的 BOM 而导致工具损坏是一件非常的事情.我不知道为什么有这么多反对票(但它让我有机会尝试获得足够的选票来赢得一个特殊的 SO 徽章;)

Having a tool breaking because of a BOM in an UTF-8 file is a very common thing in my experience. I don't know why there where so many downvotes (but then it gives me the chance to try to get enough vote to win a special SO badge ; )

更严重的是:UTF-8 BOM 通常没有多大意义它是完全有效的(尽管不鼓励)规范.现在的问题是,很多人不知道 BOM 在 UTF-8 中是有效的,因此编写了无法正确处理这些文件的损坏工具/API.

More seriously: an UTF-8 BOM doesn't typically make that much sense but it is fully valid (although discouraged) by the specs. Now the problem is that a lot of people aren't aware that a BOM is valid in UTF-8 and hence wrote broken tools / APIs that do not process correctly these files.

现在您可能有两个不同的问题:您可能想要从 Java 处理文件,或者您需要使用 Java 以编程方式创建/修复其他(损坏的)工具所需的文件.

Now you may have two different issues: you may want to process the file from Java or you need to use Java to programmatically create/fix files that other (broken) tools need.

我曾在一次咨询演出中遇到过这样的情况,帮助台会不断从用户那里获取消息,这些用户在使用某些文本编辑器时会出现问题,这些文本编辑器会弄乱 Java 生成的完全有效的 UTF-8 文件.因此,我必须通过确保从我们处理的每个 UTF-8 文件中删除 BOM 来解决该问题.

I've had the case in one consulting gig where the helpdesk would keep getting messages from users that had problems with some text editor that would mess up perfectly valid UTF-8 files produced by Java. So I had to work around that issue by making sure to remove the BOM from every single UTF-8 file we were dealing with.

如果您想从文件中删除 BOM,您可以创建一个新文件并跳过前三个字节.例如:

I you want to delete a BOM from a file, you could create a new file and skip the first three bytes. For example:

... $  file  /tmp/src.txt 
/tmp/src.txt: UTF-8 Unicode (with BOM) English text

... $  ls -l  /tmp/src.txt 
-rw-rw-r-- 1 tact tact 1733 2012-03-16 14:29 /tmp/src.txt

... $  hexdump  -C  /tmp/src.txt | head -n 1
00000000  ef bb bf 50 6f 6b 65 ...

如您所见,文件以ef bb bf"开头,这是(完全有效的)UTF-8 BOM.

As you can see, the file starts with "ef bb bf", this is the (fully valid) UTF-8 BOM.

这里有一个方法,它通过跳过前三个字节来获取一个文件并复制它:

Here's a method that takes a file and makes a copy of it by skipping the first three bytes:

 public static void workAroundbrokenToolsAndAPIs(File sourceFile, File destFile) throws IOException {
    if(!destFile.exists()) {
        destFile.createNewFile();
    }

    FileChannel source = null;
    FileChannel destination = null;

    try {
        source = new FileInputStream(sourceFile).getChannel();
        source.position(3);
        destination = new FileOutputStream(destFile).getChannel();
        destination.transferFrom( source, 0, source.size() - 3 );
    }
    finally {
        if(source != null) {
            source.close();
        }
        if(destination != null) {
            destination.close();
        }
    }
}

请注意,它是原始的":在调用它或可能发生的错误想法"[TM] 之前,您通常希望首先确保您有 BOM.

Note that it's "raw": you'd typically want to first make sure you have a BOM before calling this or "Bad Thinks May Happen" [TM].

之后您可以查看您的文件:

You can look at your file afterwards:

... $  file  /tmp/dst.txt 
/tmp/dst.txt: UTF-8 Unicode English text

... $  ls -l  /tmp/dst.txt 
-rw-rw-r-- 1 tact tact 1730 2012-03-16 14:41 /tmp/dst.txt

... $  hexdump -C /tmp/dst.txt
00000000  50 6f 6b 65 ...

BOM 不见了...

现在,如果您只是想透明地删除某个损坏的 Java API 的 BOM,那么您可以使用此处描述的 pushbackInputStream:为什么 org.apache.xerces.parsers.SAXParser 在 utf8 编码的 xml 中不跳过 BOM?

Now if you simply want to transparently remove the BOM for one your broken Java API, then you could use the pushbackInputStream described here: why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml?

private static InputStream checkForUtf8BOMAndDiscardIfAny(InputStream inputStream) throws IOException {
    PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
    byte[] bom = new byte[3];
    if (pushbackInputStream.read(bom) != -1) {
        if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
            pushbackInputStream.unread(bom);
        }
    }
    return pushbackInputStream; }

请注意,这可行,但绝对不能解决更严重的问题,即工作链中的其他工具无法与具有 BOM 的 UTF-8 文件正常工作.

Note that this works, but shall definitely NOT fix the more serious issue where you can have other tools in the work chain not working correctly with UTF-8 files having a BOM.

这里有一个问题的链接,提供更完整的答案,也涵盖其他编码:

And here's a link to a question with a more complete answer, covering other encodings as well:

Java 中的字节顺序标记搞砸了文件读取

这篇关于如何从 Java 中的 XML 文件中删除 BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆