QCon - Persistent Data Structures and Managed References

Following on from yesterday's Clojure session, I was interested to attend Rich Hickey's second one on Persistent Data Structures and Managed References. He'd touched briefly on this topic and it whet my appetite. This talk wasn't specifically on Clojure, but on how one can use immutable data structures, separation of identity and value and managed references to solve many issues that concurrent programs face.

The key to concurrent software design is immutable data structures. If a data structure is immutable then it must always be in a consistent state. Multiple threads can never see a partial update since the values never change. If one considers the number 42, it never changes, it is always 42. Similarly if we think about a composite value, for example the date "12th March 2009", it also never changes. It makes no sense to set the month to be "January". There is another date, "12th January 2009", but this is distinct from the first one.

So, values are immutable, whether they are primitives or composite values.

Functions in functional programming (and indeed really in any form of software) work on values and derive new values. For example, the function "add" takes two immutable values and evaluates to a third one that is the sum of these two. It always produces the same output for the same input, it doesn't matter when it is called, in what order it is called in relation to other functions, it always does the same thing. It is an ideal construct for building highly concurrent programs.

However, no useful program can exist that doesn't have some side effect, even if that is producing some output to a terminal. Useful programs rely on state and must consider time as a dimension. Rich introduced the concept of separation of identity from value. Most of us have been "brainwashed" by programming languages to mash these two together into the form of a piece of memory or a variable. He described identity as "a logical entity we associate with a series of causally related values (state) over time".

So what we traditionally think of as variables, should really be modelled as an identify and it's associated value. This makes a lot of sense. When we want to change the state we simply associate a different value with the identity, we don't mutate the original value. A major advantage of this is that if we need to construct a new complex value then we don't need to stop the world and lock access to a variable while we update it to a new consistent value, we can construct the new consistent value in the background (while other threads continue to see the original logically consistent value) and simply swap the association from the identity to the new immutable value when we're ready.

The downside of course is that if we want the change the value of an identity then how can do this efficiently. If the value is a simple number then it's not too difficult. Changing the state of the balance of a bank account is as simple as associating a new number with it. However, if the state we're interested is the set of all stock in our warehouse, associating a new value that requires us to make a new value that might only differ from the original by one item. With traditional data structures we would copy the original value and make the change. Of course this is very inefficient and scales up linearly.

This is where persistent data structures come in. Based on the work of Phil Bagwell, Rich Hickey described how he used bit partitioned hash tries to make efficient "copies" of data structures, and this forms the basis of data storage within Clojure. Essentially all data is stored in a tree and when one makes a copy with a small change, one can create a tree with a new root and only the path to the changed item needs to be copied and modified. The rest of the tree's branches remain precisely the same. This significantly reduces the amount of copying that is required and makes multiple "versions" of a data structure entirely practical.

It's a completely different model from that used in traditional object oriented programming, but having seen it, it makes a great deal of sense. Certainly, immutable values are great for concurrent access - no need for locks - so anything that helps us mutate state but using immutable values that are always consistent is going to be a great help.

This of course leaves the question of how do you manage the mutation of the identity as its association with these immutable values changes over time. This however is a much much smaller problem than one traditionally has with standard mutable objects. With generally mutable objects one must handle the locking and synchronisation oneself, and must do it across the whole code base. With mutable identity references, one only has to manage the concurrent access to the reference, and this can be provided as a general purpose library - or in the case of Clojure part of the language itself. As I mentioned in my blog on Clojure yesterday, it provides four standard reference types: Refs, Agents, Atoms and Vars.

This approach to managing state looks like it will be increasingly important as the number of cores in our CPUs increase and I'm pretty excited about looking at how I can make use of similar techniques in my programs, be they in a functional language such as Clojure or in Java - An interesting side effect of Clojure being based on the JVM is that its data structures and references are available as Jars that can be used directly in Java.

This site uses cookies. Continue to use the site as normal if you are happy with this, or read more about cookies and how to manage them.