Java 8的字符串重复数据删除功能 [英] String Deduplication feature of Java 8

查看:249
本文介绍了Java 8的字符串重复数据删除功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于Java中的 String (与其他语言一样)消耗大量内存,因为每个字符占用两个字节,Java 8引入了一个名为 字符串重复数据删除 ,它利用了char数组是字符串内部和最终字符串的事实,因此JVM可以搞乱它们。

Since String in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

我已阅读这个例子到目前为止,但由于我不是一个专业的java编码器,我很难掌握这个概念。

I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.

这就是它所说的,


已经考虑了各种字符串复制策略,但现在实现的
遵循以下方法:每当
垃圾收集器访问String对象,它记录了char
数组。它需要使用它们的哈希值并将其与对数组的弱
引用一起存储。一旦找到另一个具有
相同哈希码的String,它就会将它们与char进行比较。如果它们匹配为
,那么将修改一个String并指向
第二个String的char数组。然后第一个char数组不再被引用
并且可以被垃圾收集。

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

这整个过程当然会带来一些开销,但是被控制为
严格限制。例如,如果找不到字符串
重复一段时间,则不再检查该字符串。

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

我的第一个问题,

由于最近在Java 8更新20中添加了这个主题,因此仍然缺乏资源,这里的任何人都可以分享一些关于它如何帮助减少Java中 String 所消耗的内存的实际例子?

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String in Java ?

编辑:

以上链接说明,


一旦发现另一个具有相同哈希码的字符串,
将它们与char进行比较

As soon as it finds another String which has the same hash code it compares them char by char

我的第二个问题,

如果两个 String 的哈希码相同,那么 Strings 已经是相同的,那么为什么一旦发现两个 char 将它们与 char 进行比较字符串是否有相同的哈希码?

If hash code of two String are same then the Strings are already the same, then why compare them char by char once it is found that the two String have same hash code ?

推荐答案

想象一下,你有一本电话簿,里面有人,有一个 String firstName 和一个 String lastName 。而且在你的电话簿中,有10万人拥有相同的 firstName =John

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

因为你从数据库或文件中获取数据,这些字符串未被实现,因此您的JVM内存包含char数组 {'J','o','h','n'} 10万次,每个John弦一个。这些数组中的每一个都占用了20个字节的内存,因此那些100k Johns占用了2 MB的内存。

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

通过重复数据删除,JVM将意识到John是重复多次并使所有这些John字符串指向相同的底层字符数组,将内存使用量从2MB减少到20字节。

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

您可以在以下位置找到更详细的说明 JEP 。特别是:

You can find a more detailed explanation in the JEP. In particular:


许多大型Java应用程序目前在内存上存在瓶颈。测量表明,在这些类型的应用程序中,大约25%的Java堆实时数据集被String对象使用。此外,这些String对象中大约有一半是重复的,其中重复意味着 string1.equals(string2)为真。堆上有重复的String对象实际上只是浪费内存。

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

实际预期收益最终减少约10%。请注意,此数字是基于广泛应用的计算平均值。特定应用程序的堆减少量可能会有很大差异。

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

这篇关于Java 8的字符串重复数据删除功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆