Overview of the collections Package

by Doug Lea.

This overview is intended to accompany version 0.96b of the package of code (tar.gz format) (zip format) and documentation for the java collection classes. Please send comments and questions to dl@cs.oswego.edu.

Availabilty Note

The Sun Java Development Kit JDK1.2 finally includes a standard set of collection classes. While there are some design and implementation differences, the JDK1.2 package contains most of the same basic abstractions, structure, and functionality as this package. For this reason, this collections package will NOT be further updated. (The current version will of course remain freely available.) However, updates and variants of some of these classes appear in the current JDK1.2 collections framework, which I encourage you to use instead. (See also my util.concurrent package for a few other collection classes useful in concurrent programming.)

Contents

  1. Collections
  2. Updatable Collections
  3. Immutable Collections
  4. Exceptions
  5. Enumerations
  6. Checked Classes
  7. Links to Source Code
  8. ToDo List
  9. Version History

Collections

Collections are objects that hold other objects that are accessed, placed, and maintained under some set of rules. There are many different ways you could classify these different rules. This package uses one of the simplest, splitting them into four main categories, (Sets, Bags, Seqs, Maps), plus a few `mixin' variants. Top-level functionality is described entirely via interfaces, not classes. This allows some mixing and matching of capabilities at the implementation level, and also enables you to add your own alternative implementations of this functionality that will interoperate with the versions provided in this package.

As a user of the collection package, you do not need to know a lot about data structures, but you do need to know what kinds of access and placement rules you want in a particular situation. The basic choices are:

The base interface Collection reflects commonalities across all of these of kinds of rules. For reasons described below, all of the interfaces in the following diagram describe classes/objects that need not even be updatable (modifiable), so restrict themselves to operations that report properties of Collections or build new Collections with various properties.

Collection
Collection is the base interface for all collection classes defined in this package. Every collection can:
Sets
Sets are collections in which the number of occurrences of any object is at most one. They support an operation that causes includes to be true; i.e., adding an element only if it is not present.
Bags
Bags may contain multiple occurrences of each element. Bags can do anything that Sets can and more, except that they do not intrinsically guarantee that the number of occurrences of any element is at most one. (However, you can always use a Bag in a way that `manually' enforces that the number of occurrences is at most one.)
Seqs
Seqs maintain elements in sequential order. They are sequential lists that can be accessed by positional indices ranging from 0..size()-1. Other Collections that are not Seqs may guarantee some kind of sequential ordering of elements, but don't support access of these elments via indices.
Maps
Maps are collections in which each element has a key. Any kind of Object may serve as a key for an element. (While you could use regular Collections in which each element holds a pair, Maps are generally simpler, more efficient, and contain operations useful under the interpretation that the pairs represent keyed elements.)

Maps also serve as a way to use a certain field of objects of some class as a key for purposes of storing in a collection. For example, supposing you want to use some accessor prop() defined in your class as the basis for entering and finding elements, you could use a Map, entering pairs of (x.prop(), x).

ElementSortedCollections
A mixin for collections that always maintain elements in sorted order with respect to a give Comparator Note: SortedCollections rely on the ordering properties of elements NOT to change once they have been placed in the collection. They cannot in general deal with cases where you change any of their elements behind their backs in ways that break their sort order. (See however, implementation checks that can help you discover if this occurred.)

ElementSorted Collections provide a different way to use properties of objects as their access keys. You can define your class to implement interface Keyed, with a method key() that returns the key (for example it might just return prop()). Then use a Comparator that compares on keys(). The DefaultComparator class does this automatically before trying to perform comparsions, but you can write your own Comparator that does this in a more specialized and efficient way.)

KeySortedCollections
A mixin for Maps that always maintain keys in sorted order

UpdatableCollections

Updatable collections are `normal' mutable collections. They:

UpdatableSets
Updatable collections maintaining Set semantics. They support an `include' operation that adds an element if not present, plus a version that includes all elements from a supplied Eunumeration.
UpdatableBags
Updatable collections with an add(v) operation, that adds an occurrence of v to the collection, and addIfAbsent that only conditionally adds the element.
UpdatableSeqs
Seqs with standard indexed update operations (insertAt, replaceAt, removeAt). They also support method insertElementsAt(int index, Enumeration e) that adds all elements of e, in order, at position index. UpdatableSeqs also support special (often more efficient) methods operating on the front and back of the sequence,
UpdatableMaps
Maps with update operations, principally putAt(key, element) and removeAt(key).
SortableCollections
A `mixin' for collections that support a sort operation that accepts a Comparator. The difference between Sorted and Sortable is that (Element- or Key-) SortedCollections are always in sorted order, while Sortable ones may be sorted (via sort(Comparator cmp)), but may then be altered in ways so they are no longer sorted.
Generally, implementations are written at the point in the interface hierarchy that they most naturally support. They mainly use classic data structure implementation techniques, with an emphasize safety and correctness, but with enough efficiency that you will not usually have to build your own special-case data structures.

Currently, there are no `wrapped' implementation types that allow one implementation to serve as another kind of collection. You can always build these yourself. For example, it's possible to have an RBTree serve as the underlying implementation for a Set by making, say, an RBSet class holding a reference to an RBTree, and forwarding include as addIfAbsent, and so on. Similarly, there are not any Seq-like classes (like Stack or Queue) that ONLY support operations on the front or back, since you can just use UpdatableSeqs for such purposes merely by not using those other operations; or if you like, making a special class holding an UpdatableSeq and forwarding it only selected messages.

The reason there are so many different implementation types it that, in the collections business, it's not the case that one size fits all. Different Implementations have different time/space tradeoffs, and different operations they are particularly good and bad at. The basic strategy for building a collection package is to force all of these implementations to share common interfaces, so that only the person doing the initial construction of a collection object that will be passed around and used polymorphically has to care about these issues. If you do not care, just use the ones handed out by DefaultImplementations, which makes pretty reasonable choices. If you do care, be sure to look over the time complexities listed in the documentation for each implementation class.

Forcing implementations to share common interfaces is a standard OO trick, but sometimes leads to classes providing methods that they'd almost rather not support. For example LinkedLists support indexed access of the elements of the list. This is pretty slow. You'd only use a LinkedList if you only rarely need to access by index.

All current implementations are subclasses of UpdatableImpl a convenient abstract class that provides default implementations of many UpdatableCollection operations.

Also, all current UpdatableCollection classes implement constructive operations (excluding, adding, etc) by cloning full copies, which is not very fast.

Current implementations include:

HashedSets implement UpdatableSet
Standard implementation of hash tables of single elements.
RBTrees implement UpdatableBag, ElementSortedCollection
Implementation of red-black trees, a form of balanced sorted binary search tree.
LinkedLists implement UpdatableSeq, SortableCollection
Standard implementation of linked lists.
CircularLists implement UpdatableSeq
Standard implementation of double-linked circluar lists.
Dynarrays implement UpdatableSeq, SortableCollection
Buffered array that dynamically resizes when necessary.
LinkedBuffers implement UpdatableBag
An expandable list of array-based buffers.
HashedMaps implement Map
Hash Table of keyed elements,
LLMaps implement Map
LinkedLists of (key, element) pairs supporting Map operations.
RBMaps implement Map, KeySortedCollection
Red-Black trees of keyed elements
These implementation classes contain only methods described in their interfaces, plus, in most cases, a few special `tuning' methods that let you override default properties of the underlying data structures. These include:

Many of the implementation classes themselves rely on yet lower-level classes of basic data structures. These classes are also made public in this package, so that you can use them as the building-blocks of other Collection implementations. They currently include:

Cell
A simple box holding an Object
LLCell
Cells with next-pointers. Each cell knows how to do basic linked list operations like insert another LLCell after itself.
LLPair
An extension of LLCell in which each node has a key
CLCell
A cell with next- and prev- pointers, and operations so that objects maintain themselves in proper circular list order.
RBCell
A cell with red-black-tree pointers and color, and methods to add new RBCells, properly rebalancing the tree it is in upon insertion. And so on
RBPair
RBCells with keys.

Immutable Collections.

Immutable versions of collections are those that GUARANTEE not to support mutations. This is a stronger claim than is made by Collection, since any object known only to be a Collection MAY be coercible to Updatable. But an Immutable version is not. This makes Immutable classes exceedingly safe.

In Java, the concept of immutablity is best expressed as an in-the-small design pattern consisting of:

  1. A base interface describing some non-mutative functionality, for example our Collection, Set, etc., interfaces.
  2. Implementation classes that are final, and support only the operations described in the interface. The use of final means that when you think you have an immutable object, you really do -- it's not of some subclass that supports mutable operations as well.
  3. The use of the interface Immutable just to give a name to this property, so that, by convention, such classes declare implements Immutable. The actual interface Immutable is a null interface, listing no operations at all.

The collections package currently contains four implementation classes using this scheme. IncrSets IncrBags IncrSeqs IncrMaps are all implementation subclasses of abstract class IncrImpl that uses a `negative delta' algorithm to do lazy copying of collections formed via creational update methods. This algorithm can be applied to any kind of underlying UdatableCollection, so the Incr classes do not actually perform the underlying updates (they rely on UpdatableCollections supplied within constructors), they just arrange to do it with a minumum of actual copying. They normally update in a `lazy' fashion, making copies only when necessary. However, when you are using a received Immutable Collection as the basis for a very extensive set of update operations (none of which should be applied to the original), it is sometimes more efficient to build an UpdatableCollection, add all of the elements of the received ImmutableCollection to it, and then work off that. (Also, the benefits of lazy updates are not always worth their costs for small-sized collections. For those with only a few elements, it is probably better to just make copies.)

There are two standard usage patterns for Immutable Collections:

  1. As a much stronger analog of C++ `pass by const reference'; i.e., passing an updatable collection in a way that is guaranteed not to modify it. Each of the Immutable classes supports a constructor that allows you to use an UpdatableCollection in an immutable fashion. For example, suppose you have a HashedSet hs, (an implementation of UpdatableSet), and want to send it safely to some method SomeClass.someMethod(Set s). You can do this via c.someMethod(new IncrSet(hs)). Note, however, that constancy guarantees can only hold so long as you don't modify hs during the course of the call, as might happen if someMethod creates a new thread using hs and then returns, so that you then continue on modifying hs. If this is a potenital problem, use the next option.

  2. As representations of pure functional values (as used for example in languages like ML). Even though they are immutable, Immutable Collections still support constructive operations that create new collections differing from their arguments only in that they add, remove, or replace a given element. Each of the classes contain a default constructor that initializes them to use an appropriate DefaultImplementations object.

One of the main reasons for supporting all of the operations that make Immutable Collections possible is to provide tools that help keep threads from stomping on each others collections. This can be a serious problem in multithreaded Java programs. Synchronization (which is supported by all implementation classes) helps, but not enough to avoid some common errors. Immutable collections help even more. They let you pass around values that cannot be modified at all in other threads without you having to make full copies of them when you don't need to. This is the same reasoning that led to Java having two classes for Strings: String for pure immutable strings, and StringBuffer for updatable ones.

Exceptions

Collection methods are defined to throw only a small number of exception types:

java.util.NoSuchElementException
Thrown for attempts to access elements that do not exist.
IllegalElementExceptions (a subclass of IllegalArgumentException)
Thrown for attempts place null or invalid elements.
CorruptedEnumerationExceptions (A subclass of NoSuchElementException)
Thrown for attempts to access elements in enumerations that have been detectably compromised because the collection serving as the base of the collection has been modified in the course of traversal.
ImplementationErrors (a subclass of Error)
Thrown for detected failures due to coding errors (i.e., cases in which internal consistency checks or required postconditions fail).
java.lang.IllegalArgumentException
Thrown for other bad arguments.

Except for operations that accept Enumerations as arguments (for example UpdatableSeq.addElementsAt), exception throws generated from Collection operations are always accompanied by state ``rollbacks''. That is, they leave the object in a behaviorally equivalent state as it was before the operation commenced. (Operations accepting Enumerations may rollback, but do not promise to.)

Enumerations

Enumerations are the ``common currency'' of collections. Enumerations are generally ``lightweight'' and simple to use, not only within this package, but also across other Java classes that use the java.util.Enumeration interface. Because Collection methods use Enumerations as the basic element access construct, classes in this package are generally compatible with other existing Java collection classes. (Although by necessity, there is some duplication. For example, Dynamic array implementations are very similar to java.util.Vector, but different enough that they needed to be redone from scratch.)

All Collection classes actually produce CollectionEnumerations to report elements. CollectionEnumeration is a subinterface of the standard java.util.Enumeration that adds two methods: corrupted which reports true if the enumeration is detectably corrupted because of mutative operations occurring during traversals, and numberOfRemainingElements, which reports the number of elements yet to be traversed.

While they are handy, Enumerations sometimes require a bit of care. Any CollectionEnumeration produced by an UpdatableCollection can become corrupted if an update operation on the collection is performed during a traversal. The standard implementation of CollectionEnumeration classes is very conservative, and reports corruption if there is any change in the elements() set (as reflected by version changes). Thus, any attempt to obtain a nextElement can fail, raising a CorruptedEnumerationException. (Enumerations produced from Immutable collections can only fail in this way if they were constructed around an UpdatableCollection that remains asynchronously accessible via other means even after it has been wrapped within an Immutable Collection. As discussed elsewhere, while it is impossible to enforce, Immutable Collections should never be used in this way to begin with.)

Corruption exceptions occurring during operations such as Bag.addElements() are particularly tricky to revover from. Upon exceptions, these operations do not necessarily ``roll back'' to the state they were in before the operation, so recovery is hard to manage. (Although you can at least look at numberOfRemainingElements to find out how many elements were not processed.) If you are worried about such problems, set up your own code to manually traverse enumerations and perform the inner operations, but recovering in special ways upon exception.

You sometimes need to adjust standard algorithms to avoid the possibility of CorruptedEnumerationExceptions. For example, if you want to traverse through s.elements() for Set s and add other new elements depending on what's there, you CANNOT do it via code like:

  void f(UpdatableSet s) { // DO NOT DO THIS
    Enumeration e = s.elements();
    while (e.hasMoreElements()) 
      if (someProperty(e.nextElement())) 
        s.include(someNewObject()); // will cause exception for nextElement
  }

One alternative is to place new elements in an auxilliary collection during the traversal; for example:

  void f(UpdatableSet s) { 
    UpdatableBag aux = new LinkedBuffer();
    Enumeration e = s.elements();
    while (e.hasMoreElements()) 
      if (someProperty(e.nextElement())) 
        aux.add(someNewObject()); 
    s.includeElements(aux.elements()); // add them all at once   
  }

Or, you could instead use an ImmutableSet, returning the ultimate result:

  Set f(Set s) { 
    Enumeration e = s.elements();
    while (e.hasMoreElements()) 
      if (someProperty(e.nextElement())) 
        s = s.including(someNewObject()); 
    return s;
  }

Another alternative is to use a Seq instead of Sets or Bags when you need to use such algorithms, and use indexing instead of element() traversals. For example:

  void f(UpdatableSeq s) { 
    int originalSize = s.size();
    for (int i = 0; i < originalSize; ++i)
      if (someProperty(s.at(i)) 
        s.insertLast(someNewObject()); 
  }

Very few collection operations deal with other collections. Most instead deal with Enumerations. For example, there is no such operation as UpdatableSet.union(Set s) to union in the elements of another Set s. Instead there is UpdatableSet.includeElements(Enumeration e) that can be used to the same effect (via includeElements(s.elements())), but can also be used to include any set of Objects traversable from an any kind of Enumeration.

It is probably good practice to design methods and classes that use the collections package to also take Enumerations as arguments and return them as results instead of only accepting other collections. This increases the range of application of your classes.

Several specialized Enumeration classes are available to make it easier to deal in Enumerations rather than Collections or other special data structures. They include:

ArrayEnumeration
A utility class that allows you to traverse arrays as enumerations. For example:
  Object arr[] = { new Float(1.0f), new Float(2.0f) }
  Dynarray s = ...
  s.insertElementsAt(0, new ArrayEnumeration(arr));
 

FilteringEnumeration
A class that allows you to ignore elements in an enumeration that don't meet some criterion (expressed as a Predicate). For example, if you want to screen out everything but Panel objects from a collection coll that might hold things other than Panels, write something of the form:
  Enumeration e = coll.elements();
  Enumeration panels = new FilteringEnumeration(e, new IsPanel());
  while (panels.hasMoreElements()) 
   doSomethingWith((Panel)(panels.nextElement()));
  
To use this, you will also need to write a little class of the form:
  class IsPanel implements Predicate {
   boolean predicate(Object v) { return (v instanceof Panel); }
  }
  

MappingEnumeration
A class that allows you to transform elements from other enumerations before they are seen by their `consumers' by invoking an arbitrary Function) on each nextElement access. For example, if you want to process only the parent() fields of java.awt.Component elements held by a collection coll you could write something of the form:
  Enumeration e = coll.elements();
  Enumeration parents = new MappingEnumeration(e, new ParentFunction());
  while (parents.hasMoreElements()) 
   doSomethingWith((Container)(parents.nextElement()));
  
To use this, you will also need to write a little class of the form:
  class ParentFunction implements Function {
   Object function(Object v) {
     if (v instanceOf Component) return ((Component)v).getParent();
     else return null;
   }
  }
  

(You might also interpose a FilteringEnumeration here to screen out the resulting nulls.)

All this requires too much set-up to be reasonable in most situations, but is invaluable in others.

InterleavingEnumeration
A class that allows you to combine the elements of two different enumerations as if they were one enumeration before they are seen by their `consumers'. This sometimes allows you to avoid having to use a Collection object to temporarily combine two sets of Collection elements() that need to be collected together for common processing.

The elements are revealed (via nextElement()) in a purely interleaved fashion, alternating between the first and second enumerations unless one of them has been exhausted, in which case all remaining elements of the other are revealed until it too is exhausted.

For example, if you want to process together the elements of two Collections a and b, you could write something of the form:

  Enumeration items = new InterleavingEnumeration(a.elements(), b.elements());
  while (items.hasMoreElements()) 
   doSomethingWith(items.nextElement());
  

Checked Collections

A semiformal notation is used to try to nail down interfaces. This is not an exercise in formal methods. (In fact the best way formally specify interfaces and operations in `open' systems like Java is not a completely solved issue.) It is instead an attempt to be as unambiguous as possible about the effects of operations, which is easier to do with a bit of formalism. In fact, the main reason for spelling these out is to form a basis for testing. Nearly all claims made about operations in interfaces are tested for in corresponding Checked classes. So you can be reasonably sure that any claim made has been verified to hold, at least in the cases actually tested.

Checked versions of each of the major Updatable interfaces exist -- CheckedSet CheckedBag CheckedSeq CheckedMap . Each one `wraps' a normal collection object, and for each operation, asks the inner collection to perform it, checks as best it can that the effects match the specification, and then returns any results. When you have any doubt about any implementation, you can wrap it in a Checked version and watch for ImplementationError exceptions. (This is likely to slow down your program by a few orders of magnitude!)

Implementation classes are for the most part just informally documented, but contain links to the fuller descriptions of the methods they implement. However, each implementation class also supports a checkImplementation operation, that tests for internal representational consistency. While this is usually required only for testing purposes, it can come in handy in other situations as well. For example, it can help you discover whether someone changed the ordering property of an element in a SortedCollection or changed it in a way that no longer passes canInclude for most other collections.

Constructs used in the class documentation include:

Links to Source Code

Note: All of the code in this package carries the following header:

  Originally written by Doug Lea and released into the public domain. 
  Thanks for the assistance and support of Sun Microsystems Labs, Agorics 
  Inc, Loral, and everyone contributing, testing, and using this code.

This means that you can use it for any purpose whatsoever, with no restrictions.

Also, SimpleTest.java contains some very simple tests of most classes.

To Do List

History


Doug Lea
Last modified: Sat Mar 27 13:16:29 EST 1999