令人惊叹的HashMap—数据结构与hash函数

看了下 java 8 中 HashMap 的源码，结合网上一些文章，大概明白了 HashMap 中几个一直很疑惑的地方。这里就记录一下 HashMap 的学习。

数据结构和一些参数的作用

HashMap 的数据结构“平平无奇”，就是我们熟悉的“数组+链表”的方式（链地址法），并且在链表长度超过某个值（默认为 8）后，将链表转换为红黑树以提高效率。

HashMap 存储数据的数组是：

1	transient Node<K,V>[] table;

也即一个 Node 数组，Node 实现 Entry 接口，每个 Node 存储一个键值对，如下：

final int hash;
final K key;
V value;
Node<K,V> next;

其中，hash 是存储的 node 中的 key 的哈希值，key、value 即键和值，而 next 则显然是构建链表的，指向链表的下一个节点。

HashMap 中还有一些默认参数，主要是：

/**
 * The default initial capacity - MUST be a power of two.
 */
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

/**
 * The maximum capacity, used if a higher value is implicitly specified
 * by either of the constructors with arguments.
 * MUST be a power of two <= 1<<30.
 */
static final int MAXIMUM_CAPACITY = 1 << 30;

/**
 * The load factor used when none specified in constructor.
 */
static final float DEFAULT_LOAD_FACTOR = 0.75f;

/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage.
 */
static final int TREEIFY_THRESHOLD = 8;

/**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 */
static final int UNTREEIFY_THRESHOLD = 6;

/**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds.
 */
static final int MIN_TREEIFY_CAPACITY = 64;

注释已经讲的很清楚，注意的是容量 capacity 必须是 2 的整数次幂。

field 中注意的是：

/**
 * The next size value at which to resize (capacity * load factor).
 *
 * @serial
 */
// (The javadoc description is true upon serialization.
// Additionally, if the table array has not been allocated, this
// field holds the initial array capacity, or zero signifying
// DEFAULT_INITIAL_CAPACITY.)
int threshold;

当 size（k-v 对数量）超过了 threshold （capacity * load factor）时，将会进行一次扩容。

为什么容量 capacity 必须是 2^n —— 哈希函数和 rehash

之前一直对这个问题不理解，最近看了一篇文章讲解了这点才恍然大悟，不得不说，在这一约定下，HashMap 实现的非常精妙，尤其是在扩容之后对原有节点的 rehash 中。

先看一下 HashMap 的哈希函数，实现是这样的，

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

不管它什么高低位移位，总之，这一函数是根据 key 的 hashCode 计算的哈希值，但是这样算出来的值，显然是无法直接对应到 table 数组的索引的，也即无法直接确定元素位置，那么怎么求出元素处于的位置索引呢？

HashMap 中有一段出现了不知道多少次的 if 判断：

1 2	if ((p = tab[i = (n - 1) & hash]) == null) ...

这里的 n 是数组 table 的长度（table.length，注意不是 size（k-v 对的数量）），tab[i = (n - 1) & hash] 显然是取数组中的值，但是 (n - 1) & hash 怎么就变成了数组索引，让人有些不解。其实这里是一个数学上的现象。

对于 2^n，二进制形式中只有一个 1，其他位全为零（1 不会是在第零位），那么(2^n - 1) 的二进制形式，自然是高 p 位全为 0，低 q 位全为 1。n 即数组长度是 2^n，那么“(n - 1) & hash”的作用就类似于掩码，结果将只保留 n 的范围内的二进制位，(n - 1) & hash 实际上就相当于 hash % n

所以经过这样一个“magic”的操作，就拿到了 key 对应的数组索引，实现了一个哈希函数。

我们都学过，哈希函数中，如果取模的话，值应该是选一个合适大小的素数，比如小于长度的最大素数，但是 HashMap 使用容量取模作为哈希函数，却要求容量是 2 的整数次幂，不细看确实不解，真正让人惊叹的正是这一约定在 rehash 时带来的特性。

每次扩容后，按照约定，newCap = oldCap * 2，相应的，对原来存储的节点，需要进行 rehash，移动到在新的数组中的位置。

而 newCap 则就是 oldCap 左移了 1 位（右端补零），这样，与 (newCap - 1) 的 & 运算，如果 hash 在 newCap 的最高的 1 的位置的值是 0，那么 (newCap - 1) & hash 仍是原值，结果不变；而如果这一位置是 1，那么 (newCap - 1) & hash 则相当于增加了 newCap / 2 = oldCap，结果变成了 (oldCap - 1) & hash + oldCap

说起来比较绕，举个栗子：

假设原 n = 16 = (10000)B，那么 (n - 1) = 15 = (01111)B，假设 key1 的 hash = (00011)B，那么 (n - 1) & key1.hash 即
  01111
& 00011
—————————
  00011
所以 key1 的索引是 3；
假设 key2 的 hash = (10111)B，那么 (n - 1) & key2.hash 即
  01111
& 10111
—————————
  00111
所以 key2 的索引是 7；

而在扩容一次后，

n = 32 = (100000)B，那么 (n - 1) = 31 = (011111)B，key1 的 hash = (00011)B，那么 (n - 1) & key1.hash 即
  011111
& 000011
—————————
  000011
所以 key1 的索引仍然是 3；
key2 的 hash = (10111)B，那么 (n - 1) & key2.hash 即
  011111
& 010111
—————————
  010111
所以 key2 的索引是 (111)B + (10000)B = 7 + 16，即“原索引 + 原容量”

以上栗子中，对比扩容时 key1 和 key2 的情况，不难发现，扩容后索引要么是原值不变，要么是“原索引 + 原容量”，而两种情况只需要判断第 4 位的值是 0 还是 1 即可区分。

所以我们可以看到 resize() 函数中有这么一段“神秘莫测”的代码，使用 newCap & hash 判断元素位置是否需要移动：

Node<K,V> loHead = null, loTail = null;
Node<K,V> hiHead = null, hiTail = null;
Node<K,V> next;
do {
    next = e.next;
    if ((e.hash & oldCap) == 0) {
        if (loTail == null)
            loHead = e;
        else
            loTail.next = e;
        loTail = e;
    }
    else {
        if (hiTail == null)
            hiHead = e;
        else
            hiTail.next = e;
        hiTail = e;
    }
} while ((e = next) != null);
if (loTail != null) {
    loTail.next = null;
    newTab[j] = loHead;
}
if (hiTail != null) {
    hiTail.next = null;
    newTab[j + oldCap] = hiHead;
}

key 到底怎样才是相同

也是一段出现了不知道多少次的判断：

1 2	if (e.hash == hash && ((k = e.key) == key \|\| (key != null && key.equals(k))))

就是在以下情况下认为是相同的 key：

求出的 hash 值相同（根据 hashCode() 的值求出，即是比较 hashCode() 是否相同）
满足 1 的情况下，（key 是否是相同引用，或者都是 null（key 可以是 null，并且 null 和 null 也认为是相等）） || （key1.equals(key2) 返回 true）

搞清楚了 HashMap 的哈希函数和 HashMap 到底认为什么情况下 key 才是相同的这两个问题之后，那么 get/put 函数的大致流程也就清楚了（table 是否为空 -> 查找 key -> 遍历链表/查找红黑树，table 是否为空，空时 resize() -> 查找 key -> 遍历链表/查找红黑树 -> 找到替换/找不到放入 -> 链表长度大于 8 时转换为红黑树 -> 需要时进行扩容）。