Preface: This document is now of historical interest only. Some of the accounts here of processor support for ordering and atomics are obsolete and should not be relied on. For updated accounts, see, among other sources, the guide to JDK9+ memory order modes. For current processor support, see C++ mappings.
This is an unofficial guide to implementing the new Java Memory Model (JMM) specified by JSR-133 . It provides at most brief backgrounds about why various rules exist, instead concentrating on their consequences for compilers and JVMs with respect to instruction reorderings, multiprocessor barrier instructions, and atomic operations. It includes a set of recommended recipes for complying to JSR-133. This guide is "unofficial" because it includes interpretations of particular processor properties and specifications. We cannot guarantee that the intepretations are correct. Also, processor specifications and implementations may change over time.
For a compiler writer, the JMM mainly consists of rules disallowing reorderings of certain instructions that access fields (where "fields" include array elements) as well as monitors (locks).
Can Reorder | 2nd operation | |||
1st operation | Normal Load Normal Store |
Volatile Load MonitorEnter |
Volatile Store MonitorExit |
|
Normal Load Normal Store |
No | |||
Volatile Load MonitorEnter |
No | No | No | |
Volatile store MonitorExit |
No | No |
Where:
The cells for Normal Loads are the same as for Normal Stores, those for Volatile Loads are the same as MonitorEnter, and those for Volatile Stores are same as MonitorExit, so they are collapsed together here (but are expanded out as needed in subsequent tables). We consider here only variables that are readable and writable as an atomic unit -- that is, no bit fields, unaligned accesses, or accesses larger than word sizes available on a platform.
Any number of other operations might be present between the indicated 1st and 2nd operations in the table. So, for example, the "No" in cell [Normal Store, Volatile Store] says that a non-volatile store cannot be reordered with ANY subsequent volatile store; at least any that can make a difference in multithreaded program semantics.
The JSR-133 specification is worded such that the rules for both volatiles and monitors apply only to those that may be accessed by multiple threads. If a compiler can somehow (usually only with great effort) prove that a lock is only accessible from a single thread, it may be eliminated. Similarly, a volatile field provably accessible from only a single thread acts as a normal field. More fine-grained analyses and optimizations are also possible, for example, those relying on provable inaccessibility from multiple threads only during certain intervals.
Blank cells in the table mean that the reordering is allowed if the accesses aren't otherwise dependent with respect to basic Java semantics (as specified in the JLS). For example even though the table doesn't say so, you can't reorder a load with a subsequent store to the same location. But you can reorder a load and store to two distinct locations, and may wish to do so in the course of various compiler transformations and optimizations. This includes cases that aren't usually thought of as reorderings; for example reusing a computed value based on a loaded field rather than reloading and recomputing the value acts as a reordering. However, the JMM spec permits transformations that eliminate avoidable dependencies, and in turn allow reorderings.
In all cases, permitted reorderings must maintain minimal Java safety properties even when accesses are incorrectly synchronized by programmers: All observed field values must be either the default zero/null "pre-construction" values, or those written by some thread. This usually entails zeroing all heap memory holding objects before it is used in constructors and never reordering other loads with the zeroing stores. A good way to do this is to zero out reclaimed memory within the garbage collector. See the JSR-133 spec for rules dealing with other corner cases surrounding safety guarantees.
The rules and properties described here are for accesses to Java-level fields. In practice, these will additionally interact with accesses to internal bookkeeping fields and data, for example object headers, GC tables, and dynamically generated code.
Loads and Stores of final fields act as "normal" accesses with respect to locks and volatiles, but impose two additional reordering rules:
These rules imply that reliable use of final fields by Java programmers requires that the load of a shared reference to an object with a final field itself be synchronized, volatile, or final, or derived from such a load, thus ultimately ordering the initializing stores in constructors with subsequent uses outside constructors.
Compilers and processors must both obey reordering rules. No particular effort is required to ensure that uniprocessors maintain proper ordering, since they all guarantee "as-if-sequential" consistency. But on multiprocessors, guaranteeing conformance often requires emitting barrier instructions. Even if a compiler optimizes away a field access (for example because a loaded value is not used), barriers must still be generated as if the access were still present. (Although see below about independently optimizing away barriers.)
Memory barriers are only indirectly related to higher-level notions described in memory models such as "acquire" and "release". And memory barriers are not themselves "synchronization barriers". And memory barriers are unrelated to the kinds of "write barriers" used in some garbage collectors. Memory barrier instructions directly control only the interaction of a CPU with its cache, with its write-buffer that holds stores waiting to be flushed to memory, and/or its buffer of waiting loads or speculatively executed instructions. These effects may lead to further interaction among caches, main memory and other processors. But there is nothing in the JMM that mandates any particular form of communication across processors so long as stores eventually become globally performed; i.e., visible across all processors, and that loads retrieve them when they are visible.
Nearly all processors support at least a coarse-grained barrier instruction, often just called a Fence, that guarantees that all loads and stores initiated before the fence will be strictly ordered before any load or store initiated after the fence. This is usually among the most time-consuming instructions on any given processor (often nearly as, or even more expensive than atomic instructions). Most processors additionally support more fine-grained barriers.
A property of memory barriers that takes some getting used to is that they apply BETWEEN memory accesses. Despite the names given for barrier instructions on some processors, the right/best barrier to use depends on the kinds of accesses it separates. Here's a common categorization of barrier types that maps pretty well to specific instructions (sometimes no-ops) on existing processors:
On all processors discussed below, it turns out that instructions that perform StoreLoad also obtain the other three barrier effects, so StoreLoad can serve as a general-purpose (but usually expensive) Fence. (This is an empirical fact, not a necessity.) The opposite doesn't hold though. It is NOT usually the case that issuing any combination of other barriers gives the equivalent of a StoreLoad.
The following table shows how these barriers correspond to JSR-133 ordering rules.
Required barriers | 2nd operation | |||
1st operation | Normal Load | Normal Store | Volatile Load MonitorEnter |
Volatile Store MonitorExit |
Normal Load | LoadStore | |||
Normal Store | StoreStore | |||
Volatile Load MonitorEnter |
LoadLoad | LoadStore | LoadLoad | LoadStore |
Volatile Store MonitorExit |
StoreLoad | StoreStore |
Plus the special final-field rule requiring a StoreStore barrier
in
x.finalField = v; StoreStore; sharedRef = x;
Here's an example showing placements.
Java |
Instructions |
class X { int a, b; volatile int v, u; void f() { int i, j; i = a; j = b; i = v; j = u; a = i; b = j; v = i; u = j; i = u; j = b; a = i; } } |
load a load b load v LoadLoad load u LoadStore store a store b StoreStore store v StoreStore store u StoreLoad load u LoadLoad LoadStore load b store a |
The need for LoadLoad and LoadStore barriers on some processors interacts
with their ordering guarantees for dependent instructions. On some
(most) processors, a load or store that is dependent on the value of a
previous load are ordered by the processor without need for an
explicit barrier. This commonly arises in two kinds of cases,
indirection:
Load x; Load x.field
and control
Load x; if (predicate(x)) Load or Store y;
Processors that do NOT respect indirection ordering in
particular require barriers for final field access for references
initially obtained through shared references:
x = sharedRef; ... ; LoadLoad; i = x.finalField;
Conversely, as discussed below, processors that DO respect data dependencies provide several opportunities to optimize away LoadLoad and LoadStore barrier instructions that would otherwise need to be issued. (However, dependency does NOT automatically remove the need for StoreLoad barriers on any processor.)
The kinds of barriers needed on different processors further interact with implementation of MonitorEnter and MonitorExit. Locking and/or unlocking usually entail the use of atomic conditional update operations CompareAndSwap (CAS) or LoadLinked/StoreConditional (LL/SC) that have the semantics of performing a volatile load followed by a volatile store. While CAS or LL/SC minimally suffice, some processors also support other atomic instructions (for example, an unconditional exchange) that can sometimes be used instead of or in conjunction with atomic conditional updates.
On all processors, atomic operations protect against read-after-write problems for the locations being read/updated. (Otherwise standard loop-until-success constructions wouldn't work in the desired way.) But processors differ in whether atomic instructions provide more general barrier properties than the implicit StoreLoad for their target locations. On some processors these instructions also intrinsically perform barriers that would otherwise be needed for MonitorEnter/Exit; on others some or all of these barriers must be specifically issued.
Volatiles and Monitors have to be separated to disentangle these effects, giving:
Required Barriers | 2nd operation | |||||
1st operation | Normal Load | Normal Store | Volatile Load | Volatile Store | MonitorEnter | MonitorExit |
Normal Load | LoadStore | LoadStore | ||||
Normal Store | StoreStore | StoreExit | ||||
Volatile Load | LoadLoad | LoadStore | LoadLoad | LoadStore | LoadEnter | LoadExit |
Volatile Store | StoreLoad | StoreStore | StoreEnter | StoreExit | ||
MonitorEnter | EnterLoad | EnterStore | EnterLoad | EnterStore | EnterEnter | EnterExit |
MonitorExit | ExitLoad | ExitStore | ExitEnter | ExitExit |
Plus the special final-field rule requiring a StoreStore barrier in:
x.finalField = v; StoreStore; sharedRef =
x;
In this table, "Enter" is the same as "Load" and "Exit" is the same as "Store", unless overridden by the use and nature of atomic instructions. In particular:
The other types are specializations that are unlikely to play a role in compilation (see below) and/or reduce to no-ops on current processors. For example, EnterEnter is needed to separate nested MonitorEnters when there are no intervening loads or stores. Here's an example showing placements of most types:
Java |
Instructions |
class X { int a; volatile int v; void f() { int i; synchronized(this) { i = a; a = i; } synchronized(this) { synchronized(this) { } } i = v; synchronized(this) { } v = i; synchronized(this) { } } } |
enter EnterLoad EnterStore load a store a LoadExit StoreExit exit ExitEnter enter EnterEnter enter EnterExit exit ExitExit exit ExitEnter ExitLoad load v LoadEnter enter EnterExit exit ExitEnter ExitStore store v StoreEnter enter EnterExit exit |
Java-level access to atomic conditional update operations will be available in JDK1.5 via JSR-166 (concurrency utilities) so compilers will need to issue associated code, using a variant of the above table that collapses MonitorEnter and MonitorExit -- semantically, and sometimes in practice, these Java-level atomic updates act as if they are surrounded by locks.
Here's a listing of processors that are commonly used in MPs, along with links to documents providing information about them. (Some require some clicking around from the linked site and/or free registration to access manuals). This isn't an exhaustive list, but it includes processors used in all current and near-future multiprocessor Java implementations I know of. The list and the properties of processors decribed below are not definitive. In some cases I'm just reporting what I read, and could have misread. Several reference manuals are not very clear about some properties relevant to the JMM. Please help make it definitive.
Good sources of hardware-specific information about barriers and related properties of machines not listed here are Hans Boehm's atomic_ops library, the Linux Kernel Source, and Linux Scalability Effort. Barriers needed in the linux kernel correspond in straightforward ways to those discussed here, and have been ported to most processors. For descriptions of the underlying models supported on different processors, see Sarita Adve et al, Recent Advances in Memory Consistency Models for Hardware Shared-Memory Systems and Sarita Adve and Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial.
Here's how these processors support barriers and atomics:
Processor | LoadStore | LoadLoad | StoreStore | StoreLoad | Data dependency orders loads? |
Atomic Conditional |
Other Atomics |
Atomics provide barrier? |
sparc-TSO | no-op | no-op | no-op | membar (StoreLoad) |
yes | CAS: casa |
swap, ldstub |
full |
x86 | no-op | no-op | no-op | mfence or cpuid or locked insn |
yes | CAS: cmpxchg |
xchg, locked insn |
full |
ia64 | combine with st.rel or ld.acq |
ld.acq | st.rel | mf | yes | CAS: cmpxchg |
xchg, fetchadd |
target + acq/rel |
arm | dmb (see below) |
dmb (see below) |
dmb-st | dmb | indirection only |
LL/SC: ldrex/strex |
target only |
|
ppc | lwsync (see below) |
hwsync (see below) |
lwsync | hwsync | indirection only |
LL/SC: ldarx/stwcx |
target only |
|
alpha | mb | mb | wmb | mb | no | LL/SC: ldx_l/stx_c |
target only |
|
pa-risc | no-op | no-op | no-op | no-op | yes | build from ldcw |
ldcw | (NA) |
Many of these barriers usually reduce to no-ops. In fact, most of them reduce to no-ops, but in different ways under different processors and locking schemes. For the simplest examples, basic conformance to JSR-133 on x86 or sparc-TSO using CAS for locking amounts only to placing a StoreLoad barrier after volatile stores.
The conservative strategy above is likely to perform acceptably for many programs. The main performance issues surrounding volatiles occur for the StoreLoad barriers associated with stores. These ought to be relatively rare -- the main reason for using volatiles in concurrent programs is to avoid the need to use locks around reads, which is only an issue when reads greatly overwhelm writes. But this strategy can be improved in at least the following ways:
Original | => | Transformed | ||||
1st | ops | 2nd | => | 1st | ops | 2nd |
LoadLoad | [no loads] | LoadLoad | => | [no loads] | LoadLoad | |
LoadLoad | [no loads] | StoreLoad | => | [no loads] | StoreLoad | |
StoreStore | [no stores] | StoreStore | => | [no stores] | StoreStore | |
StoreStore | [no stores] | StoreLoad | => | [no stores] | StoreLoad | |
StoreLoad | [no loads] | LoadLoad | => | StoreLoad | [no loads] | |
StoreLoad | [no stores] | StoreStore | => | StoreLoad | [no stores] | |
StoreLoad | [no volatile loads] | StoreLoad | => | [no volatile loads] | StoreLoad |
Similar eliminations can be used for interactions with locks, but depend on how locks are implemented. Doing all this in the presence of loops, calls, and branches is left as an exercise for the reader. :-)
A translation of this page is available in japanese.