使用MessageDigest SHA-256的POI XSSF / XLSX散列不确定性 [英] POI XSSF / XLSX hashing indeterminism with MessageDigest SHA-256
问题描述
使用MessageDigest SHA-256实现获取POI XLSX格式的确定性哈希值似乎存在问题,即使对于空的ByteArray流也是如此。这种情况在数百次甚至数千次迭代后随机发生。
There seems to be a problem with getting deterministic hash values for the POI XLSX format, with MessageDigest SHA-256 implementation, even for empty ByteArray streams. This happens randomly, after several hundreds or even only thousands of iterations.
用于重现问题的相关代码片段:
The relevant code snippets used to reproduce the problem:
// TestNG FileTest:
@Test(enabled = true) // indeterminism at random iterations, such as 400 or 1290
public void emptyXLSXTest() throws IOException, NoSuchAlgorithmException {
final Hasher hasher = new HasherImpl();
boolean differentSHA256Hash = false;
for (int i = 0; i < 10000; i++) {
final ByteArrayOutputStream excelAdHoc1 = BusinessPlanInMemory.getEmptyExcel("xlsx");
final ByteArrayOutputStream excelAdHoc2 = BusinessPlanInMemory.getEmptyExcel("xlsx");
byte[] expectedByteArray = excelAdHoc1.toByteArray();
String expectedSha256 = hasher.sha256(expectedByteArray);
byte[] actualByteArray = excelAdHoc2.toByteArray();
String actualSha256 = hasher.sha256(actualByteArray);
if (!expectedSha256.equals(actualSha256)) {
differentSHA256Hash = true;
System.out.println("ITERATION: " + i);
System.out.println("EXPECTED HASH: " + expectedSha256);
System.out.println("ACTUAL HASH: " + actualSha256);
break;
}
}
Assert.assertTrue(differentSHA256Hash, "Indeterminism did not occur");
}
参考Hasher和POI代码:
Referenced Hasher and POI code:
// HasherImpl class:
public String sha256(final InputStream stream) throws IOException, NoSuchAlgorithmException {
final MessageDigest digest = MessageDigest.getInstance("SHA-256");
final byte[] bytesBuffer = new byte[300000];
int bytesRead = -1;
while ((bytesRead = stream.read(bytesBuffer)) != -1) {
digest.update(bytesBuffer, 0, bytesRead);
}
final byte[] hashedBytes = digest.digest();
return bytesToHex(hashedBytes);
}
试图消除因创建时间等元数据导致的不确定性,但无济于事:
Tried to eliminate indeterminism due to meta data like creation time, to no avail:
// POI BusinessPlanInMemory helper class:
public static ByteArrayOutputStream getEmptyExcel(final String fileextension) throws IOException {
Workbook wb;
if (fileextension.equals("xls")) {
wb = new HSSFWorkbook();
}
else {
wb = new XSSFWorkbook();
final POIXMLProperties props = ((XSSFWorkbook) wb).getProperties();
final POIXMLProperties.CoreProperties coreProp = props.getCoreProperties();
coreProp.setCreated("");
coreProp.setIdentifier("1");
coreProp.setModified("");
}
wb.createSheet();
final ByteArrayOutputStream excelStream = new ByteArrayOutputStream();
wb.write(excelStream);
wb.close();
return excelStream;
}
HSSF / XLS格式似乎不受所述问题的影响。
是否有人有线索,如果不是POI本身的错误,可能是什么导致这种情况?基本上,上面的代码是指
https://poi.apache.org/spreadsheet/ examples.html BusinessPlan示例
The HSSF / XLS format seems not to be affected by the problem described. Does anybody have a clue, what could be causing this, if not a bug in POI itself? Basically, the code above refers to https://poi.apache.org/spreadsheet/examples.htmlBusinessPlan example
感谢您的投入!
推荐答案
这不是一个明确的答案,但这是我的怀疑会发生什么:
This is not a definitive answer but this is my suspicion what happens:
docx和xlsx文件格式基本上是一堆压缩的xml文件。将它们重命名为.zip并使用您喜欢的zip工具打开时很容易看到。
docx and xlsx file formats are basically a bunch of zipped-up xml-files. This can easily be seen when renaming them to .zip and opening with your favorite zip-tool.
当检查由word创建的文件时,我注意到更改时间戳存档中包含的所有文件始终为 1980-01-01 00:00:00
,而在使用POI创建的文件中,它将显示文件的实际时间戳。
When examining a file created by word I noticed that the change-timestamp of all files contained in the archive is always 1980-01-01 00:00:00
while in those created with POI it will show the actual timestamp the file was created.
所以我怀疑当 excelAdHoc1
中的一个或多个文件之间存在时间戳差异时,会出现问题和 excelAdHoc2
。当创建一个或另一个文件时,时钟切换到下一秒可能会发生这种情况。
So my I suspect that your problem occurs when there is a timestamp-difference between one or more of the files in excelAdHoc1
and excelAdHoc2
. This might happen when the clock switches to the next second while creating one or the other file.
这不会影响XLS文件,因为HSSF格式不属于zipped xml-type因此不包含任何可能具有不同时间戳的嵌套文件。
This would not affect XLS-files since the HSSF-format is not of the "zipped xml"-type and thus does not contain any nested files that might have different timestamps.
要在写入文件后更改时间戳,可以尝试使用`java .util.zip``包。我没有测试过,但这应该可以解决问题:
To change the timestamps after writing the file you could try using the `java.util.zip``-package. I haven't tested it but this should do the trick:
ZipFile file = new ZipFile(pathToFile);
Enumeration<ZipEntry> e = file.entries();
while(e.hasMoreElements()) {
ZipEntry entry = e.nextElement();
entry.setTime(0L);
}
这篇关于使用MessageDigest SHA-256的POI XSSF / XLSX散列不确定性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!