POI XSSF/XLSX 散列不确定性与 MessageDigest SHA-256 [英] POI XSSF / XLSX hashing indeterminism with MessageDigest SHA-256

查看:42
本文介绍了POI XSSF/XLSX 散列不确定性与 MessageDigest SHA-256的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 MessageDigest SHA-256 实现获取 POI XLSX 格式的确定性哈希值似乎存在问题,即使对于空的 ByteArray 流也是如此.这是随机发生的,经过数百甚至数千次迭代.

There seems to be a problem with getting deterministic hash values for the POI XLSX format, with MessageDigest SHA-256 implementation, even for empty ByteArray streams. This happens randomly, after several hundreds or even only thousands of iterations.

用于重现问题的相关代码片段:

The relevant code snippets used to reproduce the problem:

// TestNG FileTest:
@Test(enabled = true) // indeterminism at random iterations, such as 400 or 1290
public void emptyXLSXTest() throws IOException, NoSuchAlgorithmException {
    final Hasher hasher = new HasherImpl();
    boolean differentSHA256Hash = false;
    for (int i = 0; i < 10000; i++) {
        final ByteArrayOutputStream excelAdHoc1 = BusinessPlanInMemory.getEmptyExcel("xlsx");
        final ByteArrayOutputStream excelAdHoc2 = BusinessPlanInMemory.getEmptyExcel("xlsx");

        byte[] expectedByteArray = excelAdHoc1.toByteArray();
String expectedSha256 = hasher.sha256(expectedByteArray);
byte[] actualByteArray = excelAdHoc2.toByteArray();
String actualSha256 = hasher.sha256(actualByteArray);

if (!expectedSha256.equals(actualSha256)) {
            differentSHA256Hash = true;
            System.out.println("ITERATION: " + i);
            System.out.println("EXPECTED HASH: " + expectedSha256);
            System.out.println("ACTUAL HASH: " + actualSha256);
            break;
        }
    }
    Assert.assertTrue(differentSHA256Hash, "Indeterminism did not occur");
}

引用的哈希器和 POI 代码:

Referenced Hasher and POI code:

// HasherImpl class:
public String sha256(final InputStream stream) throws IOException, NoSuchAlgorithmException {
    final MessageDigest digest = MessageDigest.getInstance("SHA-256");
    final byte[] bytesBuffer = new byte[300000]; 
    int bytesRead = -1;
    while ((bytesRead = stream.read(bytesBuffer)) != -1) {
        digest.update(bytesBuffer, 0, bytesRead);
    }
    final byte[] hashedBytes = digest.digest();
    return bytesToHex(hashedBytes);
}

尝试消除由于创建时间等元数据造成的不确定性,但无济于事:

Tried to eliminate indeterminism due to meta data like creation time, to no avail:

// POI BusinessPlanInMemory helper class:
public static ByteArrayOutputStream getEmptyExcel(final String fileextension) throws IOException {
    Workbook wb;

    if (fileextension.equals("xls")) {
        wb = new HSSFWorkbook();
    }
    else {
        wb = new XSSFWorkbook();
        final POIXMLProperties props = ((XSSFWorkbook) wb).getProperties();
        final POIXMLProperties.CoreProperties coreProp = props.getCoreProperties();
        coreProp.setCreated("");
        coreProp.setIdentifier("1");
        coreProp.setModified("");
    }

    wb.createSheet();

    final ByteArrayOutputStream excelStream = new ByteArrayOutputStream();
    wb.write(excelStream);
    wb.close();
    return excelStream;
}

HSSF/XLS 格式似乎不受所描述问题的影响.有没有人有线索,如果不是 POI 本身的错误,可能是什么原因造成的?基本上,上面的代码是指https://poi.apache.org/spreadsheet/examples.htmlBusinessPlan 示例

The HSSF / XLS format seems not to be affected by the problem described. Does anybody have a clue, what could be causing this, if not a bug in POI itself? Basically, the code above refers to https://poi.apache.org/spreadsheet/examples.htmlBusinessPlan example

感谢您的意见!

推荐答案

这不是一个确定的答案,但这是我怀疑会发生什么:

This is not a definitive answer but this is my suspicion what happens:

docx 和 xlsx 文件格式基本上是一堆压缩的 xml 文件.将它们重命名为 .zip 并使用您最喜欢的 zip 工具打开时,可以很容易地看到这一点.

docx and xlsx file formats are basically a bunch of zipped-up xml-files. This can easily be seen when renaming them to .zip and opening with your favorite zip-tool.

在检查由 word 创建的文件时,我注意到存档中包含的所有文件的更改时间戳始终为 1980-01-01 00:00:00 而在使用 POI 创建的文件中将显示文件创建的实际时间戳.

When examining a file created by word I noticed that the change-timestamp of all files contained in the archive is always 1980-01-01 00:00:00 while in those created with POI it will show the actual timestamp the file was created.

所以我怀疑您的问题发生在 excelAdHoc1excelAdHoc2 中的一个或多个文件之间存在时间戳差异时.在创建一个或另一个文件时时钟切换到下一秒时可能会发生这种情况.

So my I suspect that your problem occurs when there is a timestamp-difference between one or more of the files in excelAdHoc1 and excelAdHoc2. This might happen when the clock switches to the next second while creating one or the other file.

这不会影响 XLS 文件,因为 HSSF 格式不是zipped xml"类型,因此不包含任何可能具有不同时间戳的嵌套文件.

This would not affect XLS-files since the HSSF-format is not of the "zipped xml"-type and thus does not contain any nested files that might have different timestamps.

要在写入文件后更改时间戳,您可以尝试使用 `java.util.zip` 包.我还没有测试过,但这应该可以解决问题:

To change the timestamps after writing the file you could try using the `java.util.zip``-package. I haven't tested it but this should do the trick:

ZipFile file = new ZipFile(pathToFile);
Enumeration<ZipEntry> e = file.entries();
while(e.hasMoreElements()) {
    ZipEntry entry = e.nextElement();
    entry.setTime(0L);
}

这篇关于POI XSSF/XLSX 散列不确定性与 MessageDigest SHA-256的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆