Turbocharge your string keyed hashmaps
Posted on April 17th, 2008 in java, ruby, software |
This post gives you a small tip which just might make a world of difference to your java hashmap’s performance. This trick has been inspired by the “symbol” construct in Ruby language.
I have often considered using hash maps using Strings as keys as quite expensive indeed. And in many ways they often are. However if the keys used in your hashmap are either a well known set at the time of either writing the code or at least when the program starts up, the following is likely to help you make your map performance much much zippier.
In case you are not familiar with Ruby, it has a special construct called a symbol which is somewhat similar to a constant string. However you can create as many instances of it, but ruby runtime will ensure that multiple instances having the same character data will refer to the same runtime instance.
The design of any key will influence the performance of the hashmap primarily based on the performance of its hashcode and equals methods. The java.utils.HashMap implementation uses the result of hashCode() to narrow down the potential number of keys to be compared and then compares the keys based on whether they are the same instance (ie. occupy the same address space in memory) or in case they aren’t then by invoking the equals() method.
Thus if one wants to use Strings as keys, then there are at least two optimizations that could be potentially targeted :
(a) The hashcode could be cached rather having to be computed each time (Turns out this makes a positive but a rather small difference)
(b) Ensure that the same instance of strings get used for the same string data. (Turns out this does make a substantial difference).
The following two pieces of code indicate the difference.
Slower Code
map.put(new String("mykey"),/* .. some value .. */);
Object o = map.get(new String("mykey"));
Faster Code
String key = "mykey";
map.put(key,/* .. some value .. */);
// Note : In this case the same instance of the key is
// is used in both the get and the put
Object o = map.get(key);
The big reason why this makes a difference is the following line of code in java.util.HashMap
// The following line has two ampersand signs indicating a logical // and. Formatting is destroying the way it looks // (and I do not know how to fix it) if (e.hash == hash && ((k = e.key) == key || key.equals(k))) ...
Thus when comparing the keys, the code first tests whether they are identical and then they are equal. Obviously the test for identity is substantially inexpensive compared to that of equality. Thus the faster code shown above is faster since the keys are identical.
In order to be able to provide the same capability of ensuring that only one instance of a string key with a particular string data is constructed, while taking away the onus from the programmer of having to track the instance creation, a wrapper class called Symbol is used as shown below :
Update:I have updated the version to address Khalil’s and Dave’s concerns and suggestions. The important part of the modification is that the hashmap has been done away with and the getSymbol() method is modified. Earlier code which has been replaced has been commented out.
package com.dnene.utils.symbolmap;
/**
* License : Based on BSD Template
*
* Copyright (c) 2008, Dhananjay Nene
* All rights reserved.
*
* Redistribution and use in source and binary forms,
* with or without modification, are permitted provided
* that the following conditions are met:
*
* * Redistributions of source code must retain the
* above copyright notice, this list of conditions
* and the following disclaimer.
* * Redistributions in binary form must reproduce
* the above copyright notice, this list of
* conditions and the following disclaimer in the
* documentation and/or other materials provided
* with the distribution.
* * The name of the Dhananjay Nene may not be used
* to endorse or promote products derived from this
* software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
* CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
* DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
* CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
* OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*/
import java.util.HashMap;
import java.util.Map;
public final class Symbol
{
private final String t;
private final int hashcode;
// not required after update
// private static Map<String,Symbol> map = new HashMap<String,Symbol>();
// method simplified upon update
// public static Symbol getSymbol(String t)
// {
// Symbol symbol = map.get(t);
// if (symbol == null)
// {
// symbol = new Symbol(t);
// map.put(t, symbol);
// }
// return symbol;
// }
public static Symbol getSymbol(String t)
{
return new Symbol(t.intern());
}
private Symbol(String t)
{
this.t = t;
this.hashcode = t.hashCode();
}
public final String get()
{
return t;
}
@Override
public int hashCode()
{
return this.hashcode;
}
@Override
public final boolean equals(Object that)
{
return ((that instanceof Symbol) ?
(this.t == (((Symbol)that).t)) : false);
}
}
The code snippets shown earlier would be modified as follows to use Symbol
Symbol usage :
Symbol key = Symbol.getSymbol("mykey");
map.put(key,/* .. some value .. */);
// Note : In this case another instance of the symbol is created
// Note : You can also use key.get() in this case so long as you use it consistently
Symbol key = Symbol.getSymbol("mykey");
Object o = map.get(key);
Update :As per Dave’s suggestion in using String.intern(), the following could alternatively be used. Note that the performance benefits are to be had only if the symbol construction or the intern() call are made far less frequently than the calls to the getter on the map.
Symbol key = new String("mykey").intern();
map.put(key,/* .. some value .. */);
// ... elsewhere in code
Symbol key2 = new String("mykey").intern();
Object o = map.get(key2);
Performance Difference
In my benchmarks involving million keys, the get performance of maps using identical String keys and symbols was almost the same (the symbol based implentation was roughly either slower by 3% or faster by upto 10% with the average performance of the symbol implementation being faster by 4%). I did not benchmark the puts since I did not imagine they would change much. However the Symbol based implementation get method was consistently faster by at least 30% when compared to the String map performance when the string when the keys used during the ‘put’ where equal to but not identical to those used during the ‘get’. There is an overhead of creating a symbol instance compared to a string, but under most circumstances I believe this should be much more than offset by the gains.What surprised me was that the Symbol based String keys beat the get performance of using Long keys quite handsomely and consistently. I am not sure why it works so and am not sure if I had made a mistake. However since using Long keys was not something I was particularly focused on - I did not hunt down the reason for the performance difference between Symbol and Long based keys.
Update : adding a sample output of performance benchmarking sample run. The runs consist of hashmaps of million entries each, with each key being 16 characters long, and a million lookups getting done. The times mentioned are in nanoseconds for each run of a million lookups.
=== Comparison of Symbol to Long === Long lookup time : 246225003 Symbol lookup time : 148041217 Performance Ratio : 1.6632193 Reduction in time : 39% === Comparison of Symbol to Identical Strings === Usual lookup time : 151010200 Symbol lookup time : 148041217 Performance Ratio : 1.0200552 Reduction in time : 1% === Comparison of Symbol to String Copies === Copy lookup time : 231829056 Symbol lookup time : 148041217 Performance Ratio : 1.5659764 Reduction in time : 36%
Am adding the link to the source and the driver files for running your tests independently.
Symbol source and driver program
Field of use
This should be useful in a fairly large proportion of typical applications. In most most situations the possible universe of keys in the hashmap are known upfront either when writing the code or when starting up the application. If instead of creating hard coded strings or by using various string key parameters from say an XML file .. just use a Symbol instead. The construction is a little expensive but the map gets run much faster.This solution is not limited to a String. It could actually be used for any data structure which has a high cost of either hash code computation and / or equality check. In fact the version I wrote for myself was a Symbol<T>. The code shown above is only a specialised version where T is a String.
13 Responses
How could that be faster when you are just deferring the regular String lookup to the Symbol map and then only slightly optimizing *another* hash lookup. If you make a case to keep a hard reference to the Symbol object outside of the map, then you might as well do that for the String itself and get the same optimization.
@Jacob The *another* hash lookup you mention will try to check for the identity and then equality of keys. The class Symbol is constructed such that underlying strings are always identical (even if the symbol is constructed twice using two non-identical but equal strings). This is what makes the big difference.
Doesn’t String.intern() effectively do what you want?
Of course, this still seems like a pointless optimization. In my test of doing a million puts/gets, the difference in your two approaches about 100ms, or .104 milliseconds per operation. Not sure that would make a difference in most application contexts….
@Dave. Have updated the post with the sources to be able to verify my findings. I missed the Strings.intern() part. I think it will certainly make for a simpler Symbol implementation. However the Symbol implementation will still be required.
Looking at your tests, theres a couple issues: 1) you run the ‘usual’ test first without a JVM warmup– so that 1% probably shouldn’t even be there, I’d even say you’d hit a negative ratio there. 2) You aren’t accounting for the Symbol map lookup in your tests and are using a hard reference to the symbols. A real world example of the testSymbol would be to walk through each String key, do a Symbol.get(…) and then do the hashmap lookup.
@Jacob. I agree with the non-warming up and the non accounting of symbol map lookup during Symbol construction time. Here’s the rationale. Actually I am not so concerned with the timings for the usual (actually I should’ve called it identical). If a programmer can keep track of ensuring that the strings are identical - either explicitly or by always keeping a cached version of String.intern() result as suggested by Dave then Symbol class isn’t really required. The 1% keeps on varying between slightly negative and slightly positive across different runs (basically the additional cost of lookup and the cached value of hashcode are being traded off with each other). Yes the test does not reflect the cost of creation of a symbol. But in most programs, the number of times symbol is created can be made to be much much lower than the number of times it is used for lookups. What you suggest of as a real world example is actually not necessarily a good real world example - since it is quite feasible to ensure that the symbols get created far less often than the getter is called.
@Dave. To correct myself, I think if String.intern() is called each time a string key is used, one should get similar results without needing a Symbol class.
Well to begin with String being mostly an immutable object computing the hashcode once seems a natural and hashCode does cache the calculated value. The reason why fast code is indeed faster is because the compiler interns String literals http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#intern()
As to the static HashMap in the symbol class it is I am afraid a memory leak being a cache with no eviction strategy and one that can not be flushed, as a bonus getSymbol is not Thread safe! Static is indeed evil http://gbracha.blogspot.com/2008/02/cutting-out-static.html
@khalil Addressed your concerns in an updated version.
Calling intern can lead to permgen memory exhaustion see this bug issue on Xstream http://jira.codehaus.org/browse/XSTR-395 the fix involves using weak references to get around the cache size issue
If performance is the main issue and you are creating a Symbol class anyway, consider putting the associated values directly in Symbol’s fields. That will be as fast as it gets.
[…] /var/log/mind » Blog Archive » Turbocharge your string keyed hashmaps - In most most situations the possible universe of keys in the hashmap are known upfront either when writing the code or when starting up the application. If instead of creating hard coded strings or by using various string key parameters from say an XML fi Tags: actionscript, adobe, air, amazon, apple, aws, blog, business, cloudcomputing, Cluster, datacenter, development, ec2, ecma, flash, flex, foss, hashmap, ide, idea, intellij, internet, iptv, iso, java, jetbrains, Linux, mac, microsoft, mxml, networking, norway, ooxml, performance, politics, programming, ria, s3, security, standards, telecom, tips, tools, UVerse, virtualization, virtualpc, vmware, Web2.0, windows, xensource […]
I have many doubts about your optimization:
1. You should NEVER use new String(String). It forces the JVM to create a new String object.
2. String literals, such as “mykey”, are already automatically interned by Java.
3. The call to intern is slow, since it performs a lookup in a hashtable containing ALL String literals.
4. Hashcode is already stored in String, and calculated the first time it is needed (see source code of String.java)