Java 8 的字符串去重特性 [英] String Deduplication feature of Java 8

查看:58
本文介绍了Java 8 的字符串去重特性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于String在Java中(像其他语言一样)因为每个字符占用两个字节而消耗大量内存,所以Java 8引入了一个名为字符串重复数据删除 它利用了 char 数组是字符串和 final 的内部这一事实,因此 JVM 可以处理它们.

Since String in Java (like other languages) consumes a lot of memory because each character consumes two bytes, Java 8 has introduced a new feature called String Deduplication which takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

我已阅读这个例子 到目前为止,但由于我不是专业的 Java 编码员,我很难掌握这个概念.

I have read this example so far but since I am not a pro java coder, I am having a hard time grasping the concept.

这就是它所说的,

字符串复制的各种策略已经考虑过,但是现在实施的方法遵循以下方法:每当垃圾收集器访问 String 对象,它会记录字符数组.它获取他们的哈希值并将其与一个弱的数组的引用.一旦它找到另一个具有相同的哈希码,它逐个字符地比较它们.如果它们匹配为好吧,一个 String 将被修改并指向 char 数组第二个字符串.然后不再引用第一个字符数组可以被垃圾回收了.

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach: Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char. If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

这整个过程当然会带来一些开销,但是是可控的通过严格的限制.例如,如果未发现字符串具有将不再检查重复项.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

我的第一个问题,

由于最近在 Java 8 update 20 中添加了该主题,因此仍然缺乏有关此主题的资源,这里的任何人都可以分享一些有关它如何帮助减少 String 中的内存消耗的实际示例爪哇?

There is still a lack of resources on this topic since it is recently added in Java 8 update 20, could anyone here share some practical examples on how it help in reducing the memory consumed by String in Java ?

上面的链接说,

一旦它找到另一个具有相同哈希码的字符串逐个字符比较它们

As soon as it finds another String which has the same hash code it compares them char by char

我的第二个问题,

如果两个String的hash code相同那么Strings就已经相同了,那为什么要比较charchar 一旦发现两个String具有相同的哈希码?

If hash code of two String are same then the Strings are already the same, then why compare them char by char once it is found that the two String have same hash code ?

推荐答案

假设您有一个电话簿,其中包含人员,其中有一个 String firstName 和一个 String lastName.碰巧在您的电话簿中,有 100,000 人具有相同的 firstName = "John".

Imagine you have a phone book, which contains people, which have a String firstName and a String lastName. And it happens that in your phone book, 100,000 people have the same firstName = "John".

因为您从数据库或文件中获取数据,所以这些字符串不会被保留,因此您的 JVM 内存包含字符数组 {'J', 'o', 'h', 'n'} 100,000 次,每个 John 字符串一次.例如,这些数组中的每一个都占用 20 字节的内存,因此这 100k 个约翰占用了 2 MB 的内存.

Because you get the data from a database or a file those strings are not interned so your JVM memory contains the char array {'J', 'o', 'h', 'n'} 100 thousand times, one per John string. Each of these arrays takes, say, 20 bytes of memory so those 100k Johns take up 2 MB of memory.

通过重复数据删除,JVM 将意识到John"重复多次并使所有这些 John 字符串指向相同的底层字符数组,从而将内存使用量从 2MB 减少到 20 字节.

With deduplication, the JVM will realise that "John" is duplicated many times and make all those John strings point to the same underlying char array, decreasing the memory usage from 2MB to 20 bytes.

您可以在 JEP 中找到更详细的说明.特别是:

You can find a more detailed explanation in the JEP. In particular:

许多大型 Java 应用程序目前都存在内存瓶颈.测量表明,在这些类型的应用程序中,大约 25% 的 Java 堆实时数据集被 String 对象使用.此外,大约一半的 String 对象是重复的,其中重复意味着 string1.equals(string2) 为真.在堆上放置重复的 String 对象本质上只是浪费内存.

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory.

[...]

实际的预期收益最终会减少大约 10% 的堆.请注意,此数字是基于各种应用计算得出的平均值.特定应用程序的堆减少量可能会有很大差异.

The actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

这篇关于Java 8 的字符串去重特性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆