Java8u20で文字列のメモリ効率が向上

以前、こんな記事(と言ってもほとんど意訳だけ)を書きました。

実はStringはメモリリークの原因だった(※1.7.0_06未満)

上記記事の要約は、

Java7u06より前のバージョンでは、String.substringなどで文字列を切り出す際に元の文字列の内部表現であるchar配列が使いまわされるので、小さな文字列でも内部で大きなchar配列への参照が残ってしまうケースがある。
そこで、Java7u06以降は単にchar配列を必要な部分だけコピーすることでこの問題を回避した。

です。メモリ効率の低下および配列コピーという犠牲の代わりに安全性を高めたというわけです。

Java8u20ではメモリ効率向上の仕組みを導入

先ほどこんな記事を見つけました。

http://blog.codecentric.de/en/2014/08/string-deduplication-new-feature-java-8-update-20-2

以下、ほとんど意訳のみとなります。(またかよ)

Strings consume a lot of memory in any application. Especially the char[] containing the individual UTF-16 characters is contributing to most of the memory consumption of a JVM by each character eating up two bytes.
It is not uncommon to find 30% of the memory consumed by Strings, because not only are Strings the best format to interact with humans, but also popular HTTP APIs use lots of Strings. With Java 8 Update 20 we now have access to a new feature called String Deduplication, which requires the G1 Garbage Collector and is turned off by default.
String Deduplication takes advantage of the fact that the char arrays are internal to strings and final, so the JVM can mess around with them.

Various strategies for String Duplication have been considered, but the one implemented now follows the following approach:
Whenever the garbage collector visits String objects it takes note of the char arrays. It takes their hash value and stores it alongside with a weak reference to the array. As soon as it finds another String which has the same hash code it compares them char by char.
If they match as well, one String will be modified and point to the char array of the second String. The first char array then is no longer referenced anymore and can be garbage collected.

This whole process of course brings some overhead, but is controlled by tight limits. For example if a string is not found to have duplicates for a while it will be no longer checked.

文字列は大量のメモリを消費する。特に、char配列はUTF-16として表現されているため、1文字あたり2バイト以上のメモリを消費する。従ってVMが使うメモリのかなりの部分を占めることになりうる。
VMが消費するメモリの3割以上が文字列であることは珍しくない。可読性確保の観点では文字列以外の選択肢があるにも関わらず、例えばHTTPでは大量の文字列が用いられている。
Java8u20から、『文字列の重複除去』という機能が導入された。この機能は、デフォルトでOFFになっているG1ガベージコレクタを有効にすること必要だ。文字列の重複除去は、文字列を構成する文字配列がpriavteかつfinalであるためにJVMが都合よく扱えることにより実現されている。

文字列の重複除去には様々な方式が考案されたが、現在実装されているのは以下のようなものだ。

ガベージコレクタが文字列オブジェクトの内容を調べるタイミングで、中身の文字配列のハッシュ値を、その弱参照として保持しておく。

もし、同一ハッシュ値を持つ他の文字列オブジェクトを見つけたら文字配列の内容を検査し、同一ならば双方が同じ文字配列を指すように文字列オブジェクトを改編する。

すると、参照されなくなった文字配列はガベージコレクタの対象になる。

上記の手順は多少のオーバーヘッドを生むことになるが、厳密な制御の元で実行される。例えば、重複する文字列が無いと判断された場合、以降は検査の対象から外される。

この機能を有効にするには、G1ガベージコレクタを有効にした上で

-XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics

を指定すれば良いそうです。
2つ目の引数は引数名の通り「統計を表示せよ」という意味ですが、

[GC concurrent-string-deduplication, 4658.2K->0.0B(4658.2K), avg 99.6%, 0.0165023 secs]
   [Last Exec: 0.0165023 secs, Idle: 0.0953764 secs, Blocked: 0/0.0000000 secs]
      [Inspected:          119538]
         [Skipped:              0(  0.0%)]
         [Hashed:          119538(100.0%)]
         [Known:                0(  0.0%)]
         [New:             119538(100.0%)   4658.2K]
      [Deduplicated:       119538(100.0%)   4658.2K(100.0%)]
         [Young:              372(  0.3%)     14.5K(  0.3%)]
         [Old:             119166( 99.7%)   4643.8K( 99.7%)]
   [Total Exec: 4/0.0802259 secs, Idle: 4/0.6491928 secs, Blocked: 0/0.0000000 secs]
      [Inspected:          557503]
         [Skipped:              0(  0.0%)]
         [Hashed:          556191( 99.8%)]
         [Known:              903(  0.2%)]
         [New:             556600( 99.8%)     21.2M]
      [Deduplicated:       554727( 99.7%)     21.1M( 99.6%)]
         [Young:             1101(  0.2%)     43.0K(  0.2%)]
         [Old:             553626( 99.8%)     21.1M( 99.8%)]
   [Table]
      [Memory Usage: 81.1K]
      [Size: 2048, Min: 1024, Max: 16777216]
      [Entries: 2776, Load: 135.5%, Cached: 0, Added: 2776, Removed: 0]
      [Resize Count: 1, Shrink Threshold: 1365(66.7%), Grow Threshold: 4096(200.0%)]
      [Rehash Count: 0, Rehash Threshold: 120, Hash Seed: 0x0]
      [Age Threshold: 3]
   [Queue]
      [Dropped: 0]

のように表示されます。

For our convenience we do not need to add up all data ourselves but can use the handy totals calculation.
The above snippet is the forth execution of String Deduplication, it took 16ms and looked at about 120k Strings.
All of them are new, meaning not yet looked at. These numbers look different in real apps, where strings are passed multiple times, thus some might be skipped or have a hashcode already (as you might know the hash code of a String is computed lazy).
In above case all strings could be deduplicated, removing 4.5MB of data from memory.
The Table section gives statistics about the internal tracking table, and the Queue one lists how many requests for deduplication have been dropped due to load, which is one part of the overhead reduction mechanism.

全部を見るのは面倒なので、ざっくり統計だけ見ることにする。

上記の出力は、文字列の重複除去が4度目に実行された時のものだ。16ミリ秒消費して、およそ12万の文字列オブジェクトを検査している。
全ての文字列は生成されたばかりで、まだ未検査のものだ。(訳注: knownが0であることを指している)一方で実際のアプリケーションでは文字列の検査が何度も実行されるため、検査対象外になるかハッシュコードが算出済み（※文字列のハッシュコードが遅延計算されることは知っていると思うが）かで、上記のような数値にはならない。

上記のケースでは、全ての文字列に重複が存在すると判定されており、4.5MB分のメモリを解放できるとある。
Tableセクションは内部統計を示しているが、そのうちQueueはどれだけの文字列重複除去のリクエストが、アプリケーションの負荷が大きいために破棄されたかを示している。これが、オーバーヘッドを抑えるための仕組みの一部だ。

So how does this compare to String Interning? I blogged about how great String Interning is for memory efficiency. In fact the String Deduplication is almost like interning with the exception that interning reuses the whole String instance, not just the char array.

これは文字列共有の仕組み(String#intern)と何が異なるのか？これについては以前、文字列共有は如何にメモリを効率化するかという記事を書いた。実際、文字列重複除去は文字列共有とほとんど同じ仕組だと思う。文字列そのものを共有するか、内部表現を共有するかの違いだけだ。

The argument the creators of JDK Enhancement Proposal 192 make is that often developers do not know where the right place to intern strings would be, or that this place is hidden behind frameworks. As I wrote, you need some knowledge where you typically encounter duplicates (like country names).
String Deduplication also benefits duplicate Strings across applications inside the same JVM and thus also includes stuff like XML Schemas, urls, jar names etc which one normally would assume not appear multiple times.
It also adds no runtime overhead as it is performed asynchronously and concurrent during garbage collection, while String Interning happens in the application thread. This now also explains the reason we find that Thread.sleep() in above code. Without the sleep there would be too much pressure on GC, so String Deduplication would not run at all. But this is a problem only for such sample code. Real applications usually find a few ms spare time to run String Deduplication.

JDK Enhancement Proposal 192の作者は、文字列共有を使うべき箇所や、それの仕組みがフレーワークによって如何に隠蔽されているかを理解していない開発者が多いことを指摘している。
開発者は、文字列が重複する典型的な状況について意識しておくべきだ。(例えば国名など)
文字列重複除去は同一VM上で動作する異なるアプリケーション間で有効であるし、例えばXML スキーマ、URL、jarファイル名など、一つだけ存在すればよい文字列は多岐にわたる。

文字列共有がアプリケーションスレッド上で実行される一方で、文字列重複除去はガベージコレクションと同じく非同期・並列で動作するので、ラインタイムのオーバーヘッドが無い。
上記サンプルコードでThread.sleepを呼び出しているのはこれが理由だ。(訳注：原文にはサンプルコードがついています)
sleep無しだとGCの負荷が大きくなるので、文字列重複除去は動作しないだろう。
ただしそれは今回のサンプルコードに限った話で、実際のアプリケーションでは数ミリ秒のスキマ時間はあるはずだ。