Distributed Systems

97-0694

Distributed Systems

For articles on related topics, see CLIENT-SERVER COMPUTING; COMMUNICATIONS AND COMPUTERS; DATA COMMUNICATIONS; INTERNET; NETWORKS, COMPUTER; and WORLD WIDE WEB.

Nearly all large software systems are by necessity distributed. For example, enterprise-wide business systems must support multiple users running common applications across different sites. A distributed system encompasses these applications, their underlying support software, the hardware they run on, and the communication links which connect the distributed hardware. The largest and best-known distributed system is the set of computers, software, and services comprising the World Wide Web, which is so pervasive that it coexists with and connects to most other existing distributed systems. The most common distributed systems are networked client/server systems (see CLIENT-SERVER COMPUTING). This article surveys the properties of distributed systems and provides synopses of relevant research and development topics in theoretical foundations and system engineering.

Properties

Distributed systems have these defining properties:

Multiple computers (nodes)
Software for the system and its applications executes on multiple independent computers (not merely multiple processors on the same computer, which is the realm of parallel computing). These nodes may range from information appliances to personal computers to high-performance workstations to file servers to mainframes to supercomputers. Each may primarily take the role of a client that requests services by others, a server that provides computation or resource access to others, or a peer that does both. A minimal distributed system may be as small as two nodes provided that software connectivity is present.

Resource sharing
The most common reason for connecting a set of computers to operate as a distributed system is to allow them to share physical and computational resources; for example printers, files, databases, mail services, stock quotes, collaborative applications, and so on. Distributed system components that support resource sharing play a similar role as, and are increasingly indistinguishable from, operating systems.

Concurrency
Each of the nodes in a distributed system provides independent functionality, and operates concurrently with all of the others (see CONCURRENT PROGRAMMING). More than one process (executing program) per node, and more than one thread (concurrently executing task) per process may participate as components in a system. Most components are reactive, continuously responding to commands from users and messages from other components. Systems as a whole may concurrently execute a number of related applications, all relying upon common infrastructure software establishing system-wide policies, protocols, and services. Like operating systems, distributed systems are designed never to terminate, and so should remain always at least partially available.

Message passing
Software on the different computers communicates via structured message passing disciplines built upon any of a number of networking protocols (for example TCP/IP (q.v.)) in turn running on any of a number of connection technologies (for example, Ethernet and modems). The nodes in most distributed systems are completely connected -- any node may send a message to any other node. Delivery is mediated by underlying routing algorithms and related networking support. Messages may consist of commands, requests for services, event notifications, multimedia data, file contents, and even entire programs. Note that most multiprocessors communicate by shared memory rather than message passing and therefore are not distributed.

Distributed systems also possess, to varying degrees, the following characteristic properties.

Heterogeneity
The nodes participating in a system may consist of diverse computing and communication hardware. The software comprising the system may be written using diverse programming languages and development tools. Some heterogeneity issues are addressed by agreeing upon common message formats and low-level protocols that can be readily implemented across different platforms (i.e., computers such as PCs, servers, and mainframes). Others may require construction of bridges that translate one set of formats and protocols to another. More thorough integration can be attained by requiring that all nodes support a common virtual machine that processes platform-independent program instructions. This approach is taken by systems that use the Java^tm (q.v.) programming language.

Multiple protocols
Most distributed message passing differs significantly from the kinds of invocations (such as procedure calls) used within the confines of sequential programs. The most basic form of distributed communication is inherently asynchronous; similar to mailed letters in a postal system. Senders issue messages without relying on receipt or reply by their recipients. Distributed messages usually take much longer to reach recipients than do local invocations, sometimes reach recipients in a different order than they are sent, and may fail to reach them at all. Additional protocols are nearly always constructed from this basis. These may include semi-synchronous messaging in which senders wait for an acknowledgment of message receipt before proceeding, procedural messaging in which senders wait for full replies, time-out protocols in which senders only wait for replies for a certain period before proceeding, callback protocols in which receivers later issue different messages back to their senders, transactional protocols in which all messages in a given session or transaction are processed as in an all-or-none fashion (see TRANSACTION), and multicast protocols in which senders simultaneously issue messages to a group of other nodes. These and other protocols are often extended and specialized to enhance reliability, security, and efficiency.

Openness
Most sequential programs are closed: their configurations never change after execution commences. Most distributed systems are to some degree open: an unbounded number of nodes, components, and applications may be added or changed even while the system is running. This provides the extensibility necessary to accommodate expansion, and the ability to evolve and cope with the changing world in which a system resides. Openness requires that each component obey a certain minimal set of policies, conventions, and protocols to ensure interoperability among updated or added components. Historically, the most successful open systems have been those with the most minimal requirements. For example the simplicity of the HTTP protocol was a major factor in the success of the World Wide Web. Standards organizations such as ISO and ANSI, along with industrial consortia such as OMG (Object Management Group) establish the basic format and protocol standards underlying many interoperability guarantees. Individual distributed systems additionally rely on context-specific or domain-dependent policies and mechanisms.

Fault tolerance
A program running on a single computer is, at best, only as reliable as that computer. But most distributed systems remain at least partially available and functional even if some of their nodes, applications, or communication links fail or misbehave. In addition to outright failures, applications may suffer from unacceptably low quality of service due to bandwidth shortages, network contention, software overhead, or other system limitations. Because failures are relatively common and take so many different forms, fault tolerance requirements present some of the most central, yet difficult challenges in the construction of distributed systems (see FAULT TOLERANT COMPUTING).

Persistence
At least some data and programs are maintained on persistent media that outlast the execution of any given application. Persistence may be arranged at the level of file systems, database systems, or programming language run-time support mechanisms.

Security
Only authorized users may access sensitive data or perform critical operations. Security in distributed systems is intrinsically a multilevel issue, ranging from the basic safety guarantees provided by the hardware and operating systems residing on each node, to message encryption and authentication (q.v.) protocols, to mechanisms supporting larger social policy issues concerning privacy, appropriateness of content, and individual responsibility.
In open systems, a second, complementary sense of security arises: Users may not trust, or may not wish to use, unfamiliar components unless they have independent evidence about their safety and utility. Techniques for addressing trustworthiness include the use of digital certificates, and sandboxing untrusted components by disallowing their code from performing potentially dangerous operations such as modifying disk files.

Isolation
Each component is logically or physically autonomous, and communicates with others only via structured message protocols. In addition, groups of components may be segregated for purposes of functionality, performance, or security. For example, while the connectivity of a corporate distributed system may extend to the entire Internet, its essential functionality could be segregated (often by a firewall) to an intranet operating only within the company, and communicating with other parts of the system via a restricted secure protocol.

Decentralized control
No single computer is necessarily responsible for configuration, management, or policy control for the system as a whole. Distributed systems are instead federations (domains joined by protocol) of autonomous agents that agree on enough common policies and protocols to provide a given aggregate functionality. Some aspects of decentralization exist by desire; for example to provide fault tolerance. Others exist by necessity; because centralized control does not scale to the number of nodes and connections supported by contemporary systems. However, roles and tools for administering system-wide policies may be restricted to particular users.

Theoretical Foundations

Computational models

Distributed systems cannot be modeled adequately as Turing Machines (q.v.). Unlike Turing Machines, distributed systems may be open, arbitrarily extensible by adding new nodes or functionality, and reactive, continuously responding to changing environments. One overarching framework that encompasses most current approaches to modeling distributed systems is Wegner's (1997) notion of an interaction machine, an abstract characterization that encompasses any object with state and the ability to send and receive messages. Particular formalizations such as the pi calculus are used to explore rigorously the emergent properties of distributed systems, for example those surrounding security. Refinements geared toward more practical engineering efforts include two-tiered models in which each node, process, or thread in a distributed system is modeled as an active object, possessing an autonomous thread of control. Active objects are in turn structured using sets of passive objects that conform to a given sequential model of object-oriented computation.

Specification

Distributed systems do not merely compute a single function or perform a single action. Instead they perform a never-ending stream of diverse operations. Specification of the functionality of distributed systems cannot rely solely on the use of techniques (such as those based on preconditions and postconditions (see DISCRETE MATHEMATICS)) that describe inputs and outputs of sequential programs. Specifications must additionally describe ongoing properties of the system as a whole. Most approaches rely ultimately on temporal logic or related modal logics that provide at least two forms of specification (see MODEL CHECKING):

Properties that must invariantly hold true at all times. These generally take the form of safety requirements, for example that a given security breach cannot ever occur.
Properties that must eventually hold true. These generally take the form of liveness requirements, for example that a given message will be processed.

A different approach to specification is to pose requirements in terms of abstract computational models that obey simpler, more understandable, and more formally tractable properties than do real systems, thus allowing simulation and analysis of the main properties of interest. These models can be further refined to deal with additional complexities and constraints.

Fundamental limitations

The implementation of several desired properties of distributed systems runs up against inherent limitations that have been uncovered in theoretical studies. These mainly surround the ability of a set of independent nodes to together reach some global property, often based on the notion of consensus: that a set of nodes all agree about a given predicate. Consensus plays a central role for example in fault tolerance, where some nodes must agree that another node has failed. As shown by Fisher, Lynch, and Patterson (among others; see references), no algorithm can always ensure that agreement will be reached under all conditions in asynchronous message passing systems. (Informally, among the reasons is that any apparently failed node might not actually have failed, but instead is proceeding very slowly.) Even more severe limitations apply to Byzantine failures in which faulty nodes misbehave rather than halt. Such theoretical results have helped uncover algorithms and protocols that achieve desired results with a high enough probability or small enough set of restrictions that they perform extremely well in practice.

Distributed algorithms

Many distributed systems rely on a common set of basic algorithms and protocols that are employed to solve problems including the detection of termination of a distributed computation, election of a leader of a group of nodes, synchronization of redundant computations performed for the sake of fault tolerance, coordination of database transactions, and mutually exclusive access to shared resources. More specialized algorithmic problem domains include distributed simulation (see SIMULATION), electronic commerce (q.v.), digital libraries (q.v.), distributed multimedia (see MULTIMEDIA), and collaborative groupware (q.v.) applications. Research efforts across these domains entail the discovery of new algorithms and the formal analysis of their correctness and performance. For example, one approach to fault tolerance relies on virtual synchrony, a set of algorithms that ensure (within certain limitations) that all members of a group of nodes remain in agreement about the ordering of messages sent to the members.

System Engineering

Distributed systems used in commerce, industrial automation, and information services are among the largest and most complex systems in existence. These systems provide essential services relied upon by society at large, and are rapidly becoming as economically, politically, and socially important as shipping, railway, highway, and telecommunication systems have ever been. Successful development requires adherence to sound design principles and engineering practices, and reliance on an increasingly standardized set of decomposition and structuring rules that in part borrow from and extend object-oriented software design methods (see OBJECT-ORIENTED ANALYSIS and DESIGN) .

The earliest, yet still common, form of a distributed system is a client/server design, in which one or more independent servers perform centralized tasks such as database maintenance. Clients each execute programs that provide user-interface and computational capabilities, but communicate via database queries with servers whenever accessing or updating shared data. An examination of the limitations of the simplest, most fragile client/server systems reveals the main problems that are addressed in contemporary scalable, structured distributed programming frameworks:

Fixed addresses
When there are only a few fixed nodes in a system, and each performs a single dedicated task, all communication might be performed by issuing packets to fixed Ethernet addresses or broadcasting them on a local network. But these tactics do not scale; they are the distributed analogs of using raw memory addresses to locate data and instructions.
Ad hoc messaging
A small, fixed set of nodes with dedicated functionality can communicate by sending hand-crafted messages that are known to be of a form acceptable by recipients. This practice does not scale to systems providing possibly many services on possibly many nodes; it is the distributed analog of coding low-level program jumps rather than using structured object-oriented interfaces and method invocations.
Monolithic components
In the most fragile systems, each node runs a single, very large program. Such monolithic software components are difficult to design, implement, and test, and even more difficult to reuse, extend, and update.
Fixed architecture
The communication patterns, protocols, and policies seen in pure client/server designs are sometimes sensible choices for database-centric applications, but they are by no means universally appropriate. For example, a fixed architecture could not support the sorts of applications deploying mobile agents.

Solutions to these problems and limitations mainly reflect experience-driven design knowledge accrued during the historical progression from custom small-scale systems, to specialized systems such as network file systems (for example NFS; see FILE SERVER), to enterprise-level business systems, to the kinds of global multipurpose systems currently being developed. Engineering support for many of the development practices, services, and components described in the remainder of this article is increasingly provided by standardized distributed programming frameworks based upon OMG CORBA, OSF DCE, Microsoft DCOM, and the Java programming language.

Names and identifiers

Contemporary distributed systems rely on naming services that maintain sets of logical names and map them to physical addresses and associated low-level protocols. The most common and familiar naming service is DNS (Domain Naming Service), which provides a basis for the web page naming scheme used on the World Wide Web. DNS maps Internet names (for example www.sun.com) to Internet addresses, which are then further translated to hardware-level addresses and connection protocols. Most distributed systems augment general-purpose naming systems in order to maintain mappings from the services supported by the system to the nodes, processes, and software objects that provide them. Most components are not given human-readable names, but are instead assigned arbitrary object identifiers that are used by brokers (mediators that perform services on behalf of other components) and related components when locating services.

Naming services are usually implemented by distributed algorithms in which each node knows of only a small subset of the name space, but also knows of other nodes to query when an attempted lookup fails. Distributed name spaces are structured in a mainly hierarchical fashion that simplifies usage, streamlines lookup algorithms, and permits federation among different name services. Name space mappings need not be restricted to nodes or objects. For example, names may be associated with groups that are accessed through channels with multiple connection points

Interfaces and implementations

Interfaces are formal declarations of operations (methods) supported by implementation objects. Most distributed systems rely on a standard means for defining interfaces describing sets of services, enforced with an Interface Definition Language (IDL). Systems maintain interface descriptions, along with bindings to available implementations, in repositories that are tied to naming services in order to provide lookup capabilities. Object Interfaces are very similar to classes in object-oriented programming (q.v.) languages. Each interface consists of a set of service declarations. Each service is declared in a similar fashion as an object-oriented method: a named operation that may carry arguments, results, and exceptions. Arguments and results may consist of any arbitrary data, including control parameters, names or references to other components, image data, implementation code for other components, and descriptions of other interfaces. IDLs differ from object-oriented languages in that they do not permit definition of programming details indicating how a declared service is implemented. Some IDLs also support declarations of special message passing protocols that must be used when sending and receiving messages of the indicated form.

Special development tools can be used to generate code that connects declared services to implementation code written in a standard programming language, and possibly consisting of many programming-language level objects, methods, or modules. IDL-based tools typically generate code that enables components to be invoked via a particular Remote Procedure Call (RPC) or Remote Message Invocation (RMI; also known as Object RPC) protocol discipline. (The main difference between RPC and RMI is that an RMI message recipient is specified by an identifier that must be resolved by a broker; this is the role of the ORB -- Object Request Broker -- in CORBA.) Most tools create proxy objects that locate ultimate message recipients, encode (marshal, or serialize) selectors and data into lower-level buffers and packets, and transport packets. At the recipient site, symmetrical dispatch objects decode (unmarshal) packets and invoke the desired service in the corresponding local component.

Some programming languages, most notably Java, possess an interface construct that enables programmers to specify an interface, implement it, and arrange the underlying connections without the need for a separate IDL. For example, a Java interface used in an oversimplified banking system might declare the following operations to perform balance inquiries, deposits, and withdrawals on a particular account:

public interface Account {
  long getBalance(UserID id)            throws UnauthorizedAccessException;
  void deposit(long amount, UserID id)  throws UnauthorizedAccessException;
  void withdraw(long amount, UserID id) throws UnauthorizedAccessException, 
                                               InsufficientFundsException;
}

Interfaces, naming services, and associated mechanisms and constructs together make distributed system programming more similar to sequential programming. Higher-level languages and tools can be used to make distributed programming constructs indistinguishable from sequential ones. For example, a program statement such as myAccount.getBalance(myID) need not itself reveal whether it is just a local invocation within a sequential program, or a distributed invocation. This transparency (q.v.) shields developers from needing to know the location of the service, whether it has moved, the underlying protocol used to obtain it, whether it is replicated for the sake of fault tolerance, the presence of security measures such as encryption, and so on. Transparency simplifies usage of scripting languages whose main role is to glue together sets of existing distributed components to build applications. However, most development languages and tools that are used to build the underlying components provide nontransparent programming abstractions. Since distributed message passing may differ arbitrarily from local invocation with respect to semantics, latencies, and failure modes, most languages keep them separate. This forces developers to deal with distribution-specific issues as they arise, while also providing programmatic support for more limited senses of transparency that may be desirable in particular systems.

Component management

Systems composed out of many small-granularity interfaces, classes, and objects are generally more reusable, reliable, and economical than those built using only a few custom monolithic programs. For example a Bank object may be composed mainly of sets of Account objects. In some systems, just about any software object may be a candidate for independent use as a distributed component. Such practices lead to the existence of many more components than there are computers in a system. (This may be mitigated by the increasing deployment of information appliances and other plentiful small computers.)

Object components are managed by lifecycle services that are somewhat analogous to, but extend, virtual memory techniques that allow computers to act as if they have more memory than they do. Rather than having each node or process support a single service component, the system arranges that each component is made available whenever it is needed, without occupying computational resources when it is not needed. Many aspects of lifecycle support can be provided as a set of services that are otherwise constructed in the same way as any other distributed component. Support may include:

Configuration
Establishing or modifying parameters and bindings that control policies and protocols employed by one or more components; for example those surrounding authentication and quality of service.
Activation
Creating a new instance of a component in either a new process or an existing process.
Deactivation
Suspending a component and saving its state persistently for possible later re-activation.
Mobility
Transferring a component from one node to another.
Load balancing
Placing components on or moving components to the least busy nodes.
Replication
Establishing multiple copies of a component for the sake of performance or reliability.
Garbage collection (q.v.)
Destroying and releasing resources of a component that is no longer being used in the system
Versioning
Replacing a faulty or outdated implementation of a component with a new implementation.
Utilities
Supplying general-purpose functionality needed by other components, for example performing clock synchronization across nodes.

Nearly all aspects of object management, and distributed programming more generally, entail performance engineering. The most central performance issue in distributed systems is the most obvious one: Distributed messaging passing can sometimes be one million times slower than local invocations. Some of this overhead remains even when using the fastest available computing and communication technologies. This observation has led to development of increasingly efficient schemes for performing each of the sub-operations employed in message transmission. Performance optimizations that can normally be applied without compromising the integrity of system designs include:

Data caching
Saving copies of previously requested remote data and reusing them unless they have changed.
Message aggregation
Combining frequent small messages into less frequent larger ones.
Component clustering
Statically or dynamically enhancing locality by placing heavily communicating components (or their replicates) as close together as possible: in the same process, on the same machine, or on the same local network.
Protocol streamlining
Using weaker but faster protocols when possible. For example, external requests may be screened through expensive authentication checks only once upon entry into a system so they do not need to be checked on each internal message send involved in processing the requests.
Algorithmic improvements
Using algorithms that are less impacted by communication latency. For example, optimistic techniques assume the success of rarely-failing requests without waiting for verification, but are also prepared to perform expensive rollbacks or retries upon eventual notification of failure.

Architectural styles

Even when they are based on common infrastructures, different distributed systems or subsystems may be composed in accord with vastly different architectural styles. Systems or subsystems may be based on one or more design patterns including the following.

Peer services: Nearly every system includes at least some components that communicate via classic procedural request-reply mechanisms. For example, one component may handle a user's request for a particular stock quote by issuing an RPC or RMI, awaiting the result, and then displaying it to the user. However, service-based systems are by no means limited to pure client/server designs. For example, the recipient of the stock quote request may in turn exchange messages with several other peer components before it can compute the requested value.
Fault-tolerant services: There are two main approaches to improving the fault tolerance of services: replication and persistence. Replication entails cloning components and ensuring that all replicates process the same messages in the same order. If any one of them fails, others will still be able to continue. Persistence-based solutions rely on checkpointing: periodically saving the states of components so that they can be resurrected in the event of failure. There are many variants that involve both replication and persistence; for example standby techniques in which replicates persistently log all actions to the primary host, and then execute them all at once upon failure in order to achieve the correct state. Since there is no upper limit on how much replication or checkpointing is enough, and since solutions can be relatively costly in terms of resources, performance, and system complexity, any application of fault-tolerance measures involves engineering tradeoffs.
Transactional services: Transactional protocols extend the concurrency control and persistence support used in monolithic databases to the realm distributed systems. Most transactions operate on sets of service requests that must be performed in an all-or-none fashion. For example, a bank transfer operation consists of two requests, to withdraw money from one account and to deposit it in another. These two requests should fail as a unit if any problems arise. Distributed transactions differ from their sequential counterparts mainly by virtue of employing multiphase commit protocols that deal with constituent operations performed on different nodes, as well as interactions between distributed failure handling and concurrency control techniques.
Legacy services: The successes of even the earliest distributed system components are in part responsible for the fact that relatively few existing distributed systems are structured in an ideal fashion. Nearly every system must accommodate legacy components, subsystems, and applications . For example, some old transaction processing systems and databases would be too difficult, time-consuming, or disruptive to redesign and re-implement. Some legacy software may be gracefully integrated by retrospectively defining interfaces and retrofitting structured messaging protocols. Others resist such efforts and are dealt with in an ad hoc fashion.
Asynchronous messages: A number of push protocols extend the primitive asynchronous message passing style in which components issue messages without necessarily expecting replies. This style is analogous to mail systems, as well as to radio and television broadcasting. For example, a publish/subscribe stock quote service may periodically multicast quote updates to a group of subscriber components. Such protocols may involve many intermediate nodes that hierarchically route and distribute messages, as well as those employed to enhance fault tolerance. Event-based systems are structured in a similar fashion. For example, a remote sensing component may periodically issue events indicating sensed changes in the world. Perhaps the most widespread push-based system is the Usenet news system, that propagates postings to news servers across the Internet so that users may more quickly and conveniently access local copies. Similar protocols are used in software distribution systems that propagate program updates to all subscribers.
Collaborative groups: In electronic calendar, white-board, and groupware systems, sets of otherwise independent components must occasionally coordinate efforts to reach a common goal; for example to schedule a meeting. Among other options, collaborative systems may be based on shared persistent channels. Participants transiently enlist in channels maintaining messages or data needed by all group members for the course of a session. A channel may be structured for example as a tuple space, consisting of records or objects that may be entered, removed, and read by any member (see COORDINATION LANGUAGES).
Content-based processing: In the World Wide Web, as well as many multimedia systems, each server has a very simple interface, often consisting only of an operation that returns the information content specified by an identifier such as a URL (Uniform Resource Locator). However, this content is self-describing: messages indicate the nature of the software needed to display or otherwise use the data. Clients in turn map this description to locally available software (for example an image rendering program), if present, and use it to process the content. Associating content description (also known as metadata) with messages provides the flexibility needed to deal with ever-growing media types and formats, but at the expense of possible failures when clients do not possess the software needed to handle a new content type.
Mobile code: Mobile code systems extend content-based systems by employing active messages that include not descriptions, but the actual software (for example a set of Java classes) or instructions needed by a component to perform a given function. Agent systems further extend these capabilities by permitting some degree of autonomy to mobile code, so that it may in turn spawn additional agents on other nodes while in the process of, for example, searching for the best price for an item requested by a consumer (see MULTI-AGENT SYSTEMS). Often, this code is structured to allow disconnected operation, in which some processing proceeds even when other nodes are temporarily unreachable.

References

1985. Fischer, Michael J., Lynch, Nancy A., Paterson, Michael S. ``Impossibility of Distributed Consensus with One Faulty Process'', Journal of the ACM 32(2) (April), 374--382.
1993. Mullender, Sape (ed.). Distributed Systems, 2nd edition. Reading, MA: Addison-Wesley.
1994. Coulouris, George, Dollimore, Jean, and Kindberg, Tim. Distributed Systems: Concepts and Design, 2nd edition. Reading, MA: Addison-Wesley.
1996. Barbosa, Valmir. An Introduction to Distributed Algorithms. Cambridge, MA: MIT Press.
1996. Lynch, Nancy. Distributed Algorithms. San Mateo, CA: Morgan Kaufmann.
1996. Zomaya, Albert (ed.). Parallel and Distributed Computing Handbook. New York: McGraw Hill.
1997. Wegner, Peter. ``Why Interaction Is More Powerful Than Algorithms'', Communications of the ACM 40(5) (May), 80-91.

Jos Marlowe, Doug Lea, and Malcolm Atkinson

Doug Lea

Last modified: Tue Sep 1 08:55:19 EDT 1998