Windows Vista Forums
Vista Forums Home Join Vista Forums Windows 7 Forum Vista Tutorials Tags
Welcome to Windows Vista Forums. Our forum is dedicated to helping you find solutions with any problems, errors or issues you are experiencing with Windows Vista. The Vista forum also covers news and updates and has an extensive Windows Vista tutorial section that covers a wide range of tips and tricks.

Go Back   Vista Forums > Misc Newsgroups > .NET General

Vista - List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Reply
 
Old 04-06-2008   #1 (permalink)
Jules Winfield


 
 

List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Assume I have a set of items, integers let's say, and I need to determine if
a particular integer exists within the set.

If the list contained 10,000 items, clearly I'd want to use some type of
dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
List<>. If the list only contained two or three items, a simple array or
List<> would probably be more efficient. Is there some general rule that
dictates the size at which one would be better off using a dictionary-based
class as opposed to an array or List<> for lookups?

Thanks,

Jules



My System SpecsSystem Spec
Old 04-06-2008   #2 (permalink)
Jeroen Mostert


 
 

Re: List<> versus HashSet<>: Which is more efficient for lookupson small sets?

Jules Winfield wrote:
Quote:

> Assume I have a set of items, integers let's say, and I need to determine if
> a particular integer exists within the set.
>
> If the list contained 10,000 items, clearly I'd want to use some type of
> dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
> List<>. If the list only contained two or three items, a simple array or
> List<> would probably be more efficient. Is there some general rule that
> dictates the size at which one would be better off using a dictionary-based
> class as opposed to an array or List<> for lookups?
>
No. Profile and see. Rules of thumb can be formulated, but it's still
heavily dependent on specific use (not to mention processor architecture,
cache sizes and all sorts of other things you really don't want to be
bothering with in general). I usually put the cutoff at 100 items as an
order of magnitude approximation (i.e. if I know it will never be more than
100 I use a list), but I don't kid myself thinking that this will always
yield optimal results.

The important thing to keep in mind is scalability. Assuming that a
collection will always be small is a dangerous thing to do and requires that
one can show why not only why this is so but also why it pays off to count
on it. It's obviously much more costly to gamble wrong when you assume
collections will be small than it is to gamble wrong when you assume they'll
be big.

--
J.
My System SpecsSystem Spec
Old 04-06-2008   #3 (permalink)
Barry Kelly


 
 

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Jules Winfield wrote:
Quote:

> Assume I have a set of items, integers let's say, and I need to determine if
> a particular integer exists within the set.
>
> If the list contained 10,000 items, clearly I'd want to use some type of
> dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
> List<>. If the list only contained two or three items, a simple array or
> List<> would probably be more efficient. Is there some general rule that
> dictates the size at which one would be better off using a dictionary-based
> class as opposed to an array or List<> for lookups?
A general rule? Don't micro-optimize, and use a library collection with
the right API and, if possible, performance curve (i.e. big-Oh) for your
data set. The right API is more important: you can replace the
implementation later if measurements show that it matters.

For cases where it really does matter, you need to measure. From my
measurements (microbenchmarks with a hot cache), the switchover to
Dictionary<> or HashSet<> being more efficient than a simple array comes
pretty quickly, and they're not particularly slow to start with
(normally beating List<int> on almost anything):

Total times for 1310720 iterations of 2 probes in collections of size 4:
List<int> (linear scan) 0.186 sec
List<int> (binary search) 0.231 sec
HashSet<int> 0.140 sec
Dictionary<int,int> 0.154 sec
int[] (manual inline scan) 0.109 sec
int[] (Array.IndexOf<int>) 0.154 sec
int[] (Array.BinarySearch<int>) 0.190 sec

Total times for 327680 iterations of 10 probes in collections of size
30:
List<int> (linear scan) 0.412 sec
List<int> (binary search) 0.330 sec
HashSet<int> 0.165 sec
Dictionary<int,int> 0.181 sec
int[] (manual inline scan) 0.183 sec
int[] (Array.IndexOf<int>) 0.250 sec
int[] (Array.BinarySearch<int>) 0.285 sec

Total times for 327680 iterations of 10 probes in collections of size
100:
List<int> (linear scan) 0.903 sec
List<int> (binary search) 0.392 sec
HashSet<int> 0.170 sec
Dictionary<int,int> 0.181 sec
int[] (manual inline scan) 0.338 sec
int[] (Array.IndexOf<int>) 0.382 sec
int[] (Array.BinarySearch<int>) 0.353 sec

Bear in mind that linear scans through an array are probably slightly
kinder on the cache than HashSet, but linear scans only scale linearly.
HashSet is probably quite a bit kinder on cache than (e.g.)
SortedDictionary, but of course, it's not sorted.

* List<int> tests done with IndexOf & BinarySearch
* manual inline scan is a hand-coded for-loop
* The numbers above were gotten by doubling the iteration count,
(starting with 10), until the fastest time was greater than 1 ms.
* The profiling infrastructure is a constant baked into every time - it
includes calling through a delegate
* Times are best run of 5 on a Q6600 @ 2.8GHz, .NET 3.5, release mode.

-- Barry

--
http://barrkel.blogspot.com/
My System SpecsSystem Spec
Old 04-06-2008   #4 (permalink)
Steven Cheng [MSFT]


 
 

RE: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Hi Jules,

For such performance issue, it's hard to give a definite number or boundary
based on appearnce, I agree other members's suggestion on profiling the
application to get the actual result.

For general rules, if the count of elements is a small fixed number(such as
less than 100 and will not increase ), it's ok to use simple List or Array
since it just take little constant time. However, if the count may vary(in
large range) and the search is performing frequently against it, use a more
search-ready structure(such as binary sort tree or hashtable) is preferred.
I suggest you have a look at the book "Programming Pearls" which may give
you some good ideas on this.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead


Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
msdnmg@xxxxxx.

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscripti...ult.aspx#notif
ications.

==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
--------------------
Quote:

>NNTP-Posting-Date: Sun, 06 Apr 2008 10:53:04 -0500
>From: "Jules Winfield" <Jules.Winfield@xxxxxx>
>Newsgroups: microsoft.public.dotnet.general
>Subject: List<> versus HashSet<>: Which is more efficient for lookups on
small sets?
Quote:

>Date: Sun, 6 Apr 2008 08:53:04 -0700
Quote:

>Assume I have a set of items, integers let's say, and I need to determine
if
Quote:

>a particular integer exists within the set.
>
>If the list contained 10,000 items, clearly I'd want to use some type of
>dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
>List<>. If the list only contained two or three items, a simple array or
>List<> would probably be more efficient. Is there some general rule that
>dictates the size at which one would be better off using a
dictionary-based
Quote:

>class as opposed to an array or List<> for lookups?
>
>Thanks,
>
>Jules
>
>
>
My System SpecsSystem Spec
Old 04-07-2008   #5 (permalink)
Frans Bouma [C# MVP]


 
 

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Barry Kelly wrote:
Quote:

> Jules Winfield wrote:
>
Quote:

> > Assume I have a set of items, integers let's say, and I need to
> > determine if a particular integer exists within the set.
> >
> > If the list contained 10,000 items, clearly I'd want to use some
> > type of dictionary class (Dictionary<> or HashSet<> or
> > SortedList<>) instead of a List<>. If the list only contained two
> > or three items, a simple array or List<> would probably be more
> > efficient. Is there some general rule that dictates the size at
> > which one would be better off using a dictionary-based class as
> > opposed to an array or List<> for lookups?
>
> A general rule? Don't micro-optimize, and use a library collection
> with the right API and, if possible, performance curve (i.e. big-Oh)
> for your data set. The right API is more important: you can replace
> the implementation later if measurements show that it matters.
Indeed, one should go for the construct with the best O() performance.
HashSet has O(1) and List<T> has O(n) for looking up an element, so
that's pretty clear. What I find interesting is that your benchmarks
don't show this. ->
Quote:

> For cases where it really does matter, you need to measure. From my
> measurements (microbenchmarks with a hot cache), the switchover to
> Dictionary<> or HashSet<> being more efficient than a simple array
> comes pretty quickly, and they're not particularly slow to start with
> (normally beating List<int> on almost anything):
>
> Total times for 1310720 iterations of 2 probes in collections of size
> 4: List<int> (linear scan) 0.186 sec
> List<int> (binary search) 0.231 sec
> HashSet<int> 0.140 sec
> Dictionary<int,int> 0.154 sec
> int[] (manual inline scan) 0.109 sec
> int[] (Array.IndexOf<int>) 0.154 sec
> int[] (Array.BinarySearch<int>) 0.190 sec
In a way this is a bit surprising. Especially since a List<int> uses
internally an int[] construct to store the elements.

So my question then is: what exactly did you lookup and where were
these values located in the set of 4 possible elements? O(1) is real
time, not amortized, for HashSet, but your test suggests otherwise
(amortized). If your tests in general had the element to find in the
first half of the linear list for example, it's not a fair benchmark.

FB
--
------------------------------------------------------------------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
My System SpecsSystem Spec
Old 04-07-2008   #6 (permalink)
Barry Kelly


 
 

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Frans Bouma [C# MVP] wrote:
Quote:

> Barry Kelly wrote:
>
Quote:

> > For cases where it really does matter, you need to measure. From my
> > measurements (microbenchmarks with a hot cache), the switchover to
> > Dictionary<> or HashSet<> being more efficient than a simple array
> > comes pretty quickly, and they're not particularly slow to start with
> > (normally beating List<int> on almost anything):
> >
> > Total times for 1310720 iterations of 2 probes in collections of size
> > 4: List<int> (linear scan) 0.186 sec
> > List<int> (binary search) 0.231 sec
> > HashSet<int> 0.140 sec
> > Dictionary<int,int> 0.154 sec
> > int[] (manual inline scan) 0.109 sec
> > int[] (Array.IndexOf<int>) 0.154 sec
> > int[] (Array.BinarySearch<int>) 0.190 sec
>
> In a way this is a bit surprising. Especially since a List<int> uses
> internally an int[] construct to store the elements.
>
> So my question then is: what exactly did you lookup and where were
> these values located in the set of 4 possible elements? O(1) is real
> time, not amortized, for HashSet, but your test suggests otherwise
> (amortized).
For small values of n, the fixed overhead dominates, I would think. For
larger values of n, some cache effects should start coming in.
Quote:

> If your tests in general had the element to find in the
> first half of the linear list for example, it's not a fair benchmark.
Sure, for a very small set of probes (2 in this case), the data will
probably be skewed - but the total time shouldn't be high in any case.
We are talking >1 million iterations in a fifth of a second or so.

The code was using new Random(42).Next(dataset_count) as its series for
numbers to probe for. Here's the full code I used:

---8<---
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

class App
{
const int ProfileIterCount = 10;
const int ProfileTestCount = 5;
const double MinTime = 0.1;

static double Profile(Action code, int iterCount)
{
Stopwatch w = Stopwatch.StartNew();
for (int i = 0; i < iterCount; ++i)
code();
return w.ElapsedTicks / (double) Stopwatch.Frequency;
}

static double Profile(Action setup, Action code, int iterCount)
{
double bestResult = Double.PositiveInfinity;

for (int i = 0; i < ProfileTestCount; ++i)
{
if (setup != null)
setup();
double result = Profile(code, iterCount);
if (result < bestResult)
bestResult = result;
}

return bestResult;
}

class ProfileItem
{
public string Name { get; set; }
public Action Code { get; set; }
public Action Setup { get; set; }
public double BestTime { get; set; }

// Time will be at least 0.1 seconds, therefore we can get away
// with three digits.
public override string ToString()
{
return string.Format("{0,-35} {1,7:F3} sec", Name,
BestTime);
}
}

static int Profile(params ProfileItem[] items)
{
int iterCount = ProfileIterCount;

too_fast:
foreach (var item in items)
{
item.BestTime = Profile(item.Setup, item.Code, iterCount);
if (item.BestTime < MinTime)
{
iterCount *= 2;
goto too_fast;
}
}

return iterCount;
}

static void Main(string[] args)
{
try
{
int itemCount = args.Length > 0 ? int.Parse(args[0]) : 5;
int probeCount = args.Length > 1 ? int.Parse(args[1])
: itemCount / 2;
List<int> intList = null;
HashSet<int> intHash = null;
Dictionary<int,int> intDict = null;
int[] intArray = null;
Random r = null;

ProfileItem[] items = new[]
{
new ProfileItem()
{
Name = "List<int> (linear scan)",
Setup = delegate
{
intList = new List<int>(
Enumerable.Range(0, itemCount));
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intList.Count);
if (!intList.Contains(test))
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "List<int> (binary search)",
Setup = delegate
{
intList = new List<int>(
Enumerable.Range(0, itemCount));
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intList.Count);
if (intList.BinarySearch(test) < 0)
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "HashSet<int>",
Setup = delegate
{
intHash = new HashSet<int>(
Enumerable.Range(0, itemCount));
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intHash.Count);
if (!intHash.Contains(test))
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "Dictionary<int,int>",
Setup = delegate
{
intDict = new Dictionary<int,int>();
for (int i = 0; i < itemCount; ++i)
intDict.Add(i, i);
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intDict.Count);
if (!intDict.ContainsKey(test))
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "int[] (manual inline scan)",
Setup = delegate
{
intArray = new int[itemCount];
for (int i = 0; i < itemCount; ++i)
intArray[i] = i;
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intArray.Length);
int foundIndex = -1;
for (int j = 0; j < intArray.Length; ++j)
if (intArray[j] == test)
{
foundIndex = j;
break;
}
if (foundIndex < 0)
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "int[] (Array.IndexOf<int>)",
Setup = delegate
{
intArray = new int[itemCount];
for (int i = 0; i < itemCount; ++i)
intArray[i] = i;
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intArray.Length);
int foundIndex =
Array.IndexOf(intArray, test);
if (foundIndex < 0)
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "int[] (Array.BinarySearch<int>)",
Setup = delegate
{
intArray = new int[itemCount];
for (int i = 0; i < itemCount; ++i)
intArray[i] = i;
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intArray.Length);
int foundIndex =
Array.BinarySearch(intArray, test);
if (foundIndex < 0)
Console.WriteLine("bad news");
}
},
},
};

int count = Profile(items);
Console.WriteLine(
"Total times for {0} iterations of {1} probes in collections of" +
" size {2}:",
count, probeCount, itemCount);
for (int i = 0; i < items.Length; ++i)
Console.WriteLine(items[i]);
}
catch (Exception ex)
{
Console.Error.WriteLine(ex.Message);
}
}
}
--->8---

It's a micro-benchmark with an overhead in its measuring, so the usual
caveats apply.

Usage: <.exe> [item-count] [probe-count]

-- Barry

--
http://barrkel.blogspot.com/
My System SpecsSystem Spec
Old 04-07-2008   #7 (permalink)
Barry Kelly


 
 

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Frans Bouma [C# MVP] wrote:
Quote:

> Barry Kelly wrote:
Quote:

> > Total times for 1310720 iterations of 2 probes in collections of size
> > 4:
I ran this again, with ranking (for comparison with later run):

---8<---
Total times for 1310720 iterations of 2 probes in collections of size 4:
List<int> (linear scan) 0.174 sec
List<int> (binary search) 0.213 sec
HashSet<int> 0.133 sec
Dictionary<int,int> 0.142 sec
int[] (manual inline scan) 0.102 sec
int[] (Array.IndexOf<int>) 0.157 sec
int[] (Array.BinarySearch<int>) 0.187 sec

Ranking:
int[] (manual inline scan) 0% slower than best
HashSet<int> 30% slower than best
Dictionary<int,int> 39% slower than best
int[] (Array.IndexOf<int>) 54% slower than best
List<int> (linear scan) 71% slower than best
int[] (Array.BinarySearch<int>) 84% slower than best
List<int> (binary search) 109% slower than best
--->8---
Quote:

> So my question then is: what exactly did you lookup and where were
> these values located in the set of 4 possible elements? O(1) is real
> time, not amortized, for HashSet, but your test suggests otherwise
> (amortized). If your tests in general had the element to find in the
> first half of the linear list for example, it's not a fair benchmark.
Here's another run using more probes, to even out the distribution; it
does change things, but not by a whole lot:

---8<---
Total times for 40960 iterations of 100 probes in collections of size 4:
List<int> (linear scan) 0.259 sec
List<int> (binary search) 0.308 sec
HashSet<int> 0.190 sec
Dictionary<int,int> 0.203 sec
int[] (manual inline scan) 0.141 sec
int[] (Array.IndexOf<int>) 0.244 sec
int[] (Array.BinarySearch<int>) 0.277 sec

Ranking:
int[] (manual inline scan) 0% slower than best
HashSet<int> 35% slower than best
Dictionary<int,int> 44% slower than best
int[] (Array.IndexOf<int>) 73% slower than best
List<int> (linear scan) 84% slower than best
int[] (Array.BinarySearch<int>) 97% slower than best
List<int> (binary search) 119% slower than best
--->8---

The ranking stays the same, and the percentage differences aren't that
much different either.

Added ranking code (for the program I posted earlier):

---8<---
Console.WriteLine();
Console.WriteLine("Ranking:");
items = items.OrderBy(item => item.BestTime).ToArray();
foreach (var item in items)
{
Console.WriteLine("{0,-37} {1,5:F0}% slower than best",
item.Name,
(item.BestTime - items[0].BestTime) * 100 /
items[0].BestTime);
}
--->8---

-- Barry

--
http://barrkel.blogspot.com/
My System SpecsSystem Spec
Old 04-07-2008   #8 (permalink)
Barry Kelly


 
 

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

Barry Kelly wrote:
Quote:

> int[] (manual inline scan) 0% slower than best
For "slower", read "excess time over best as a percentage of best time".

-- Barry

--
http://barrkel.blogspot.com/
My System SpecsSystem Spec
Reply

Thread Tools


Similar Threads
Thread Forum
Most efficient virtual machine? Virtual PC
iTunes opening windows as 'Small Icons' but want 'List' - how ? Vista file management
Erasing all saved names from lookups . . Vista performance & maintenance
A more efficient way to copy files? Vista file management
A small developer-oriented wish list for PowerShell PowerShell


Vista Forums is an independent web site and has not been authorized,
sponsored, or otherwise approved by Microsoft Corporation.
"Windows Vista", the Start Orb, and related materials are trademarks of Microsoft Corp.
© Designer Media Ltd

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46