If the buckets get too full, then we have to look through
a very long linked list.
And that's kind of defeating the point.
So here's an example where I have four buckets.
I have elephant and badger in my HashSet so far.
This is a pretty good situation, right?
Each element has zero or one elements.
Now we put two more elements into our HashSet.
buckets elements
------- -------
0 elephant
1 otter
2 badger
3 cat
This isn't too bad either.
Every bucket only has one element
.
So if I wanna know, does this contain panda?
I can very quickly look at bucket number 1 and it's not
there and
I known it's not in our collection.
If I wanna know if it contains cat, I look at bucket
number 3,
I find cat, I very quickly know if it's in our
collection.
What if I add koala, well that's not so bad.
buckets elements
------- -------
0 elephant
1 otter -> koala
2 badger
3 cat
Maybe now instead of in bucket number 1 only looking at
one element,
I need to look at two.
But at least I don't have to look at elephant, badger and
cat.
If I'm again looking for panda, it can only be in bucket
number 1 and
I don't have to look at anything other then otter and
koala.
But now I put alligator in bucket number 1 and you can
see maybe where this is going.
That if bucket number 1 keeps getting bigger and bigger
and
bigger, then I'm basically having to look through all of
those elements to find
something that should be in bucket number 1.
buckets elements
------- -------
0 elephant
1 otter -> koala ->alligator
2 badger
3 cat
If I start adding strings to other buckets,
right, the problem just gets bigger and bigger in every
single bucket.
How do we stop our buckets from getting too full?
The solution here is that
"the HashSet can automatically
resize the number of buckets."
There's the HashSet realizes that the buckets are getting
too full.
It's losing this advantage of this all of one lookup for
elements.
And it'll just create more buckets(generally twice as before) and
then place the elements into the correct bucket.
So here's our basic HashSet implementation with separate
chaining.
Now I'm going to create a "self-resizing HashSet".
This HashSet is going to realize that the buckets are
getting too full and
it needs more buckets.
loadFactor is another field in our HashSet class.
loadFactor represents the average number of elements per
bucket,
above which we want to resize.
loadFactor is a balance between space and time.
If the buckets get too full then we'll resize.
That takes time, of course, but
it may save us time down the road if the buckets are a
little more empty.
Let's see an example.
Here's a HashSet, we've added four elements so far.
Elephant, dog, cat and fish.
buckets elements
------- -------
0
1 elephant
2 cat ->dog
3 fish
4
5
At this point, I've decided that the loadFactor, the
threshold,
the average number of elements per bucket that I'm okay
with, is 0.75.
The number of buckets is buckets.length, which is 6, and
at this point our HashSet has four elements, so the
current size is 4.
We'll resize our HashSet, that is we'll add more buckets,
when the average number of elements per bucket exceeds
the loadFactor.
That is when current size divided by buckets.length is
greater than loadFactor.
At this point, the average number of elements per bucket
is 4 divided by 6.
4 elements, 6 buckets, that's 0.67.
That's less than the threshold I set of 0.75 so we're
okay.
We don't need to resize.
But now let's say we add woodchuck.
buckets elements
------- -------
0
1 elephant
2 woodchuck-> cat ->dog
3 fish
4
5
Woodchuck would end up in bucket number 3.
At this point, the currentSize is 5.
And now the average number of elements per bucket
is the currentSize divided by buckets.length.
That's 5 elements divided by 6 buckets is 0.83.
And this exceeds the loadFactor which was 0.75.
In order to address this problem, in order to make the
buckets perhaps a little
more empty so that operations like determining whether a
bucket contains
an element will be a little less complex, I wanna resize
my HashSet.
Resizing the HashSet takes two steps.
First I'll double the number of buckets, I had 6 buckets,
now I'm going to have 12 buckets.
Note here that the loadFactor which I set to 0.75 stays the same.
But the number of buckets changed is 12,
the number of elements stayed the same, is 5.
5 divided by 12 is around 0.42, that's well under our
loadFactor,
so we're okay now.
But we're not done because some of these elements are in
the wrong bucket now.
For instance, elephant.
Elephant was in bucket number 2 because the number of
characters in elephant
was 8.
We have 6 buckets, 8 minus 6 is 2.
That's why it ended up in number 2.
But now that we have 12 buckets, 8 mod 12 is 8, so
elephant does not belong in bucket number 2 anymore.
Elephant belongs in bucket number 8.
What about woodchuck?
Woodchuck was the one that started this whole problem.
Woodchuck ended up in bucket number 3.
Because 9 mod 6 is 3.
But now we do 9 mod 12.
9 mod 12 is 9, woodchuck goes to bucket number 9.
And you see the advantage of all this.
Now bucket number 3 only has two elements whereas before
it had 3.
So here's our code,
where we had our HashSet with separate chaining that
didn't do any resizing.
Now, here's a new implementation where we use resizing.
Most of this code is the same,
we're still going to determine whether it contains the
value already.
If it doesn't, then we'll figure it out which bucket it
should go into and
then add it to that bucket, add it to that LinkedList.
But now we increment the currentSize field.
currentSize was the field that kept track of the number
of elements in our HashSet.
We're going to increment it and then we're going to look
at the average load,
the average number of elements per bucket.
We'll do that division down here.
We have to do a little bit of casting here to make sure
that we get a double.
And then, we'll compare that average load to the field
that I've set as
0.75 when I created this HashSet, for instance, which was
the loadFactor.
If the average load is greater than the loadFactor,
that means there's too many elements per bucket on
average, and I need to reinsert.
So here's our implementation of the method to reinsert
all the elements.
First, I'll create a local variable called oldBuckets.
Which is referring to the buckets as they currently stand
before I start resizing everything.
Note I'm not creating a new array of linked lists just yet.
I'm just renaming buckets as oldBuckets.
Now remember buckets was a field in our class, I'm going
to now create a new array
of linked lists but this will have twice as many elements
as it did the first time.
Now I need to actually do the reinserting,
I'm going to iterate through all of the old buckets.
Each element in oldBuckets is a LinkedList of strings
that is a bucket.
I'll go through that bucket and get each element in that
bucket.
And now I'm gonna reinsert it into the newBuckets.
I will get its hashCode.
I will figure out which index it is.
And now I get the new bucket, the new LinkedList of
strings and
I'll add it to that new bucket.
So to recap, HashSets as we've seen are arrays of Linked
Lists, or buckets.
A self resizing HashSet can realize using some ratio or
HashMap
isn't really well written. That's why the JDK developers rewriteHashMap
in Java 8.HashMap
in Oracle JDK 7, you can see that in theaddEntry
method (called fromput(k, v)
), theresize
method will only be called when(size >= threshold) && (null != table[bucketIndex])
which means that size has to reach the load factor (i.e.75%
) of the capacity, AND, the current bucket has collision. Therefore, load factor is only part of the story in Oracle JDK 7. In Oracle JDK 8, the latter condition no longer exists.