4 Local and Distributed Computing
The major differences between local and distributed computing concern the areas of latency, memory access, partial failure, and concurrency.1 The difference in latency is the most obvious, but in many ways is the least fundamental. The often overlooked differences concerning memory access, partial failure, and concurrency are far more difficult to explain away, and the differences concerning partial failure and concurrency make unifying the local and remote computing models impossible without making unacceptable compromises.
本地和分布式计算之间的主要区别涉及延迟,内存访问,部分故障和并发等方面。延迟的差异是最明显的,但在许多方面是最不重要的。关于内存访问、部分故障和并发性的差异常常被忽略,要解释清楚这些差异要困难得多,而关于部分故障和并发性的差异使得在不做出不可接受的妥协的情况下,统一本地和远程计算模型是不可能的。
4.1 Latency
The most obvious difference between a local object invocation and the invocation of an operation on a remote (or possibly remote) object has to do with the latency of the two calls. The difference between the two is currently between four and five orders of magnitude, and given the relative rates at which processor speed and network latency speeds are changing, the difference in the future promises to be at best no better, and will likely be worse. It is this disparity in efficiency that is often seen as the essential difference between local and distributed computing.
本地对象调用与远程(或可能是远程)对象上的调用之间最明显的区别与两个调用的延迟有关。两者之间的差异目前在4到5个数量级之间,考虑到处理器速度和网络延迟速度发生变化的相对速率,未来的差异最多不会更好,而且可能会更糟。正是这种效率差异经常被视为本地和分布式计算之间的本质区别。
Ignoring the difference between the performance of local and remote invocations can lead to designs whose implementations are virtually assured of having performance problems because the design requires a large amount of communication between components that are in different
address spaces and on different machines. Ignoring the difference in the time it takes to make a remote object invocation and the time it takes to make a local object invocation is to ignore one of the major design areas of an application. A properly designed application will require determining, by understanding the application being designed, what objects can be made remote and what objects must be clustered together.
忽略本地调用和远程调用性能之间的差异会导致设计的实现几乎肯定会出现性能问题,因为设计需要位于不同地址空间和不同机器上的组件之间进行大量通信。忽略进行远程对象调用和进行本地对象调用所花费的时间上的差异会忽略应用程序的主要设计领域之一。正确设计的应用程序需要通过了解正在设计的应用程序来确定哪些对象可以远程创建,哪些对象必须聚集在一起。
The vision outlined earlier, however, has an answer to this objection. The answer is two-pronged. The first prong is to rely on the steadily increasing speed of the underlying hardware to make the difference in latency irrelevant. This, it is often argued, is what has happened to efficiency concerns having to do with everything from high level languages to virtual memory. Designing at the cutting edge has always required that the hardware catch up before the design is efficient enough for the real world. Arguments from efficiency seem to have gone out of style in software engineering, since in the past such concerns have always been answered by speed increases in the underlying hardware.
然而,早些时候概述的设想对这一反对意见有一个答复。答案是双管齐下。第一种方法是依赖底层硬件的稳定增长速度,使延迟的差异变得无关紧要。人们常常认为,这就是从高级语言到虚拟内存,效率问题涉及到的所有问题(的解决方法)。在最前沿的设计总是要求硬件在设计对现实世界足够有效之前赶上来。效率的争论似乎在软件工程中已经过时了,因为在过去,这些问题总是通过底层硬件的速度提升来解决。
The second prong of the reply is to admit to the need for tools that will allow one to see what the pattern of communication is between the objects that make up an application. Once such tools are available, it will be a matter of tuning to bring objects that are in constant contact to the same address space, while moving those that are in relatively infrequent contact to wherever is most convenient. Since the vision allows all objects to communicate using the same underlying mechanism, such tuning will be possible by simply altering the implementation details (such as object location) of the relevant objects. However, it is important to get the application correct first, and after that one can worry about efficiency.
第二种方法是承认需要一些工具,这些工具将允许开发人员了解组成应用程序的对象之间的通信模式。一旦这些工具可用,就需要进行调优,将经常接触的对象存储到相同的地址空间,同时将相对不经常接触的对象移动到最方便的地方。由于该设想允许所有对象使用相同的底层机制进行通信,因此只需更改相关对象的实现细节(如对象位置)就可以实现此类调优。但是,首先要确保应用程序正确,然后才能考虑效率。
Whether or not it will ever become possible to mask the efficiency difference between a local object invocation and a distributed object invocation is not answerable a priori. Fully masking the distinction would require not only advances in the technology underlying remote object invocation, but would also require changes to the general programming model used by developers.
是否有可能掩盖本地对象调用和分布式对象调用之间的效率差异,这并不需要事先回答。完全掩盖这种区别不仅需要远程对象调用底层技术的改进,而且还需要更改开发人员使用的通用编程模型。
If the only difference between local and distributed object invocations was the difference in the amount of time it took to make the call, one could strive for a future in which the two kinds of calls would be conceptually indistinguishable. Whether the technology of distributed computing has moved far enough along to allow one to plan products based on such technology would be a matter of judgement, and rational people could disagree as to the wisdom of such an approach.
如果本地和分布式对象调用之间的唯一区别是调用所花费的时间的不同,那么将来这两种调用在概念上是不可区分的。分布式计算技术是否已经发展到足以让人们基于这种技术来规划产品,这是一个判断问题,理性的人可能不同意这种方法的智慧。
However, the difference in latency between the two kinds of calls is only the most obvious difference. Indeed, this difference is not really the fundamental difference between the two kinds of calls, and that even if it were possible to develop the technology of distributed calls to an extent that the difference in latency between the two sorts of calls was minimal, it would be unwise to construct a programming paradigm that treated the two calls as essentially similar. In fact, the difference in latency between local and remote calls, because it is so obvious, has been the only difference most see between the two, and has tended to mask the more irreconcilable differences.
然而,这两种调用之间的延迟差异只是最明显的差异。实际上,这种差异并不是两种调用间的根本区别,即使有可能将分布式调用技术发展到两种调用之间的延迟差异最小的程度,那么构建一个将两个调用视为基本相似的编程范式也是不明智的。事实上,由于本地调用和远程调用之间的延迟差异非常明显,是两者之间最常见的唯一差异,所以往往掩盖了更不可调和的差异。
4.2 Memory Access
A more fundamental (but still obvious) difference between local and remote computing concerns the access to memory in the two cases—specifically in the use of pointers. Simply put, pointers in a local address space are not valid in another (remote) address space. The system can paper over this difference, but for such an approach to be successful, the transparency must be complete. Two choices exist: either all memory access must be controlled by the underlying system, or the programmer must be aware of the different types of access—local and remote. There is no inbetween.
本地和远程计算之间更基本(但仍然明显)的区别在于两种情况下的内存访问 - 特别是在使用指针时。简单地说,本地地址空间中的指针在另一个(远程)地址空间中无效。系统可以克服这种差异,但要使这种方法成功,必须要完全透明。存在两种选择:要么所有内存访问必须由底层系统控制,要么程序员必须知道访问的不同类型——本地访问和远程访问。没有中间选择。
If the desire is to completely unify the programming model—to make remote accesses behave as if they were in fact local—the underlying mechanism must totally control all memory access. Providing distributed shared memory is one way of completely relieving the programmer from worrying about remote memory access (or the difference between local and remote). Using the object-oriented paradigm to the fullest, and requiring the programmer to build an application with “objects all the way down,” (that is, only object references or values are passed as method arguments) is another way to eliminate the boundary between local and remote computing. The layer underneath can exploit this approach by marshalling and unmarshalling method arguments and return values for intra-address space transmission.
如果希望完全统一编程模型 ——使远程访问表现得像本地访问一样——那么底层机制必须完全控制所有内存访问。提供分布式共享内存是完全免除程序员担心远程内存访问(或本地和远程之间的差异)的一种方法。最大限度地使用面向对象范式,并要求程序员构建一个“从头到尾都是对象”的应用程序(也就是说,只有对象引用或值作为方法参数传递),是消除本地计算和远程计算之间界限的另一种方法。下面的层可以通过编组和反编组方法参数以及用于地址内空间传输的返回值来利用这种方法。
But adding a layer that allows the replacement of all pointers to objects with object references only permits the developer to adopt a unified model of object interaction. Such a unified model cannot be enforced unless one also removes the ability to get address-space-relative pointers from the language used by the developer. Such an approach erects a barrier to programmers who want to start writing distributed applications, in that it requires that those programmers learn a new style of programming which does not use address-space-relative pointers. In requiring that programmers learn such a language, moreover, one gives up the complete transparency between local and distributed computing.
添加这样一层来更换指向对象引用的指针,可以让开发者进行对象交互时使用统一的接口——这是不可能实现的,除非让开发者不再使用相对地址空间的指针。这种方法为想要开始编写分布式应用程序的程序员设置了障碍,因为它要求那些程序员学习一种不使用地址空间相对指针的新编程风格。此外,在要求程序员学习这种语言时,就放弃了本地和分布式计算之间的完全透明性。
Even if one were to provide a language that did not allow obtaining address-space-relative pointers to objects (or returned an object reference whenever such a pointer was requested), one would need to provide an equivalent way of making cross-address space reference to entities other than objects. Most programmers use pointers as references for many different kinds of entities. These pointers must either be replaced with something that can be used in cross-address space calls or the programmer will need to be aware of the difference between such calls (which will either not allow pointers to such entities, or do something special with those pointers) and local calls. Again, while this could be done, it does violate the doctrine of complete unity between local and remote calls. Because of memory access constraints, the two have to differ.
即使要提供一种不允许获取指向对象的地址空间指针的语言(或者在请求此类指针时返回对象引用),也需要提供一种对对象以外的实体进行跨地址空间引用的等效方法。大多数程序员使用指针作为许多不同类型实体的引用。这些指针必须替换为可以在跨地址空间调用中使用的一些东西,否则程序员将需要知道这些调用(不允许指向这些实体的指针,或者对这些指针做一些特殊的处理)和本地调用之间的区别。同样,虽然这是可以做到的,但它确实违反了本地调用和远程调用之间完全统一的原则。由于内存访问限制,这两者必须有所不同。
The danger lies in promoting the myth that “remote access and local access are exactly the same” and not enforcing the myth. An underlying mechanism that does not unify all memory accesses while still promoting this myth is both misleading and prone to error. Programmers buying into the myth may believe that they do not have to change the way they think about programming. The programmer is therefore quite likely to make the mistake of using a pointer in the wrong context, producing incorrect results. “Remote is just like local,” such programmers think, “so we have just one unified programming model.” Seemingly, programmers need not change their style of programming. In an incomplete implementation of the underlying mechanism, or one that allows an implementation language that in turn allows direct access to local memory, the system does not take care of all memory accesses, and errors are bound to occur. These errors occur because the programmer is not aware of the difference between local and remote access and what is actually happening “under the covers.”
危险在于宣扬“远程访问和本地访问完全相同”的假说,而不是实施假说。一种底层机制不能统一所有内存访问,同时仍然提倡这种神话,这既具有误导性,又容易出错。相信这个神话的程序员可能认为他们不需要改变他们的编程思维。因此,程序员很可能在错误的上下文中使用指针,从而产生错误的结果。“远程就像本地一样,”这样的程序员认为,“所以我们只有一个统一的编程模型。”看起来,程序员不需要改变他们的编程风格。在底层机制的不完整实现中,或者允许实现语言直接访问本地存储器,系统不会处理所有内存访问,那么必然会发生错误。之所以出现这些错误,是因为程序员没有意识到本地访问和远程访问之间的区别,以及“实际上发生了什么”。
The alternative is to explain the difference between local and remote access, making the programmer aware that remote address space access is very different from local access. Even if some of the pain is taken away by using an interface definition language like that specified in [1] and having it generate an intelligent language mapping for operation invocation on distributed objects, the programmer aware of the difference will not make the mistake of using pointers for cross-address space access. The programmer will know it is incorrect. By not masking the difference, the programmer is able to learn when to use one method of access and when to use the other.
另一种方法是解释本地访问和远程访问之间的区别,使程序员意识到远程地址空间访问与本地访问是不一样的。即使通过使用[1]中指定的接口定义语言并使其为分布式对象上的操作调用生成智能语言映射来消除一些痛苦,认识到这一差异的程序员就不会犯使用指针进行跨地址空间访问的错误。程序员会知道它是错误的。通过不掩盖这种差异,程序员能够学习何时使用一种访问方法,何时使用另一种访问方法。
Just as with latency, it is logically possible that the difference between local and remote memory access could be completely papered over and a single model of both presented to the programmer. When we turn to the problems introduced to distributed computing by partial failure and concurrency, however, it is not clear that such a unification is even conceptually possible.
与延迟一样,从逻辑上讲,本地内存访问和远程内存访问之间的差异可能被完全掩盖,并将两者的单一模型呈现给程序员。然而,当我们引入分布式计算的部分故障和并发性问题时,甚至不清楚这种统一在概念上是否可行。
4.3 Partial failure and concurrency
While unlikely, it is at least logically possible that the differences in latency and memory access between local computing and distributed computing could be masked. It is not clear that such a masking could be done in such a way that the local computing paradigm could be used to produce distributed applications, but it might still be possible to allow some new programming technique to be used for both activities. Such a masking does not even seem to be logically possible, however, in the case of partial failure and concurrency. These aspects appear to be different in kind in the case of distributed and local computing.2
虽然不太可能,但至少在逻辑上可能会掩盖本地计算和分布式计算之间的延迟和内存访问的差异。目前还不清楚是否可以用本地计算范式来生成分布式应用程序的方式来实现这种掩盖,但是仍然可能允许将一些新的编程技术用于这两种活动。但是,对于故障和并发问题,这种屏蔽甚至在逻辑上都不可能。在分布式和本地计算的情况下,这些方面似乎是不同的。
Partial failure is a central reality of distributed computing. Both the local and the distributed world contain components that are subject to periodic failure. In the case of local computing, such failures are either total, affecting all of the entities that are working together in an application, or detectable by some central resource allocator (such as the operating system on the local machine).
故障是分布式计算的一个核心现实。本地和分布式系统都包含易受周期性故障影响的组件。在本地计算的情况下,这样的故障要么是完全的,影响到应用程序中一起工作的所有实体,要么是由某个中央资源分配器(如本地机器上的操作系统)检测到的。
This is not the case in distributed computing, where one component (machine, network link) can fail while the others continue. Not only is the failure of the distributed components independent, but there is no common agent that is able to determine what component has failed and inform the other components of that failure, no global state that can be examined that allows determination of exactly what error has occurred. In a distributed system, the failure of a network link is indistinguishable from the failure of a processor on the other side of that link.
在分布式计算中不是这种情况,其中一个组件(机器,网络链路)可能会发生故障而其他组件继续运行。不仅分布式组件的故障是独立的,而且没有公共代理能够确定哪些组件发生故障,并将该故障通知其他组件,也没有可以检查的全局状态来准确确定发生了什么错误。在分布式系统中,网络链路的故障与该链路另一侧的处理器的故障无法区分。
These sorts of failures are not the same as mere exception raising or the inability to complete a task, which can occur in the case of local computing. This type of failure is caused when a machine crashes during the execution of an object invocation or a network link goes down, occurrences that cause the target object to simply disappear rather than return control to the caller. A central problem in distributed computing is insuring that the state of the whole system is consistent after such a failure; this is a problem that simply does not occur in local computing.
这些类型的故障不同于仅仅引发异常或无法完成任务这些本地计算中可能会发生的故障。当在执行对象调用期间机器崩溃或网络链接断开时,会导致目标对象简单地消失而不是将控制权返回给调用者,从而导致此类故障。分布式计算的一个核心问题是确保在发生这种故障后整个系统的状态是一致的;这是一个在本地计算中根本不会发生的问题。
The reality of partial failure has a profound effect on how one designs interfaces and on the semantics of the operations in an interface. Partial failure requires that programs deal with indeterminacy. When a local component fails, it is possible to know the state of the system that caused the failure and the state of the system after the failure. No such determination can be made in the case of a distributed system. Instead, the interfaces that are used for the communication must be designed in such a way that it is possible for the objects to react in a consistent way to possible partial failures.
故障对于如何设计接口和接口中操作的语义有着很大的影响。故障要求程序处理不确定性。当本地组件发生故障时,可以知道导致故障的系统的状态以及故障后系统的状态。而在分布式系统中,不能做出这样的决策。相反,用于通信的接口必须设计成对象能够以一致的方式对可能的故障作出反应的方式。
Being robust in the face of partial failure requires some expression at the interface level. Merely improving the implementation of one component is not sufficient. The interfaces that connect the components must be able to state whenever possible the cause of failure, and there must be interfaces that allow reconstruction of a reasonable state when failure occurs and the cause cannot be determined.
要在故障面前保持健壮性,需要在接口级别上进行一些表示。仅仅改进一个组件的实现是不够的。连接组件的接口必须能够在任何可能的情况下声明失败的原因,并且必须有接口能在发生故障且无法确定原因时重构合理状态。
If an object is coresident in an address space with its caller, partial failure is not possible. A function may not complete normally, but it always completes. There is no indeterminism about how much of the computation completed. Partial completion can occur only as a result of circumstances that will cause the other components to fail.
如果一个对象与其调用者在地址空间中是共存的,则不会出现故障。一个函数可能不能正常完成,但它总是要完成的。关于完成了多少计算,没有任何不确定性。部分完成只能在导致其他组件失败的情况下发生。
The addition of partial failure as a possibility in the case of distributed computing does not mean that a single object model cannot be used for both distributed computing and local computing. The question is not “can you make remote method invocation look like local method invocation?” but rather “what is the price of making remote method invocation identical to local method invocation?” One of two paths must be chosen if one is going to have a unified model.
在分布式计算的情况下考虑故障可能性并不意味着单个对象模型不能用于分布式计算和本地计算。问题不是“你能使远程方法调用看起来像本地方法调用吗?”而是“使远程方法调用与本地方法调用相同的代价是什么?”“如果要有一个统一的模型,必须选择两条路径中的一条。
The first path is to treat all objects as if they were local and design all interfaces as if the objects calling them, and being called by them, were local. The result of choosing this path is that the resulting model, when used to produce distributed systems, is essentially indeterministic in the face of partial failure and consequently fragile and nonrobust. This path essentially requires ignoring the extra failure modes of distributed computing. Since one can’t get rid of those failures, the price of adopting the model is to require that such failures are unhandled and catastrophic.
第一条路径是将所有对象视为本地对象,并设计所有接口,就好像调用它们并由它们调用的对象是本地的一样。选择这条路径的结果是,当用于产生分布式系统时,所得到的模型在面对故障时本质上是不确定的,因此是脆弱的和非鲁棒的。这条路径基本上需要忽略分布式计算的额外故障模式。由于无法排除这些故障,采用该模型的代价是要求此类故障是未处理的和灾难性的。
The other path is to design all interfaces as if they were remote. That is, the semantics and operations are all designed to be deterministic in the face of failure, both total and partial. However, this introduces unnecessary guarantees and semantics for objects that are never intended to be used remotely. Like the approach to memory access that attempts to require that all access is through system-defined references instead of pointers, this approach must also either rely on the discipline of the programmers using the system or change the implementation language so that all of the forms of distributed indeterminacy are forced to be dealt with on all object invocations.
另一条路径是将所有接口设计为远程接口。也就是说,语义和操作都被设计成在面对全部和部分故障时具有确定性。然而,这为那些从未打算远程使用的对象引入了不必要的保证和语义。就像内存访问方法试图要求所有访问都是通过系统定义的引用而不是指针一样,这种方法也必须依赖程序员使用系统的规则或改变实现语言,以便在所有对象调用上强制处理所有分布式不确定性形式。
This approach would also defeat the overall purpose of unifying the object models. The real reason for attempting such a unification is to make distributed computing more like local computing and thus make distributed computing easier. This second approach to unifying the models makes local computing as complex as distributed computing. Rather than encouraging the production of distributed applications, such a model will discourage its own adoption by making all object-based computing more difficult.
这种方法也违背了统一对象模型的总体目的。尝试这种统一的真正原因是使分布式计算更像本地计算,从而使分布式计算更容易。第二种统一模型的方法使本地计算与分布式计算一样复杂。这样的模型不但不会鼓励分布式应用程序的产生,反而会使所有基于对象的计算更加困难,从而阻碍其自身的采用。
Similar arguments hold for concurrency. Distributed objects by their nature must handle concurrent method invocations. The same dichotomy applies if one insists on a unified programming model. Either all objects must bear the weight of concurrency semantics, or all objects must ignore the problem and hope for the best when distributed. Again, this is an interface issue and not solely an implementation issue, since dealing with concurrency can take place only by passing information from one object to another through the agency of the interface. So either the overall programming model must ignore significant modes of failure, resulting in a fragile system; or the overall programming model must assume a worst-case complexity model for all objects within a program, making the production of any program, distributed or not, more difficult.
类似的论点也适用于并发性。分布式对象本质上必须处理并发方法调用。如果坚持使用统一的编程模型,同样的二分法也适用。要么所有对象都必须承担并发语义的重要性,要么所有对象都忽略该问题并在分布式时抱最好的希望。同样,这是接口问题,而不仅仅是实现问题,因为处理并发只能通过接口的代理将信息从一个对象传递到另一个对象。因此,要么整个编程模型必须忽略重要的故障模式,从而导致脆弱的系统;或者,整个编程模型必须为程序中的所有对象假定最坏情况下的复杂性模型,这使得任何程序的生产,无论是否分布式,都更加困难。
One might argue that a multi-threaded application needs to deal with these same issues. However, there is a subtle difference. In a multi-threaded application, there is no real source of indeterminacy of invocations of operations. The application programmer has complete control over invocation order when desired. A distributed system by its nature introduces truly asynchronous operation invocations. Further, a non-distributed system, even when multi-threaded, is layered on top of a single operating system that can aid the communication between objects and can be used to determine and aid in synchronization and in the recovery of failure. A distributed system, on the other hand, has no single point of resource allocation, synchronization, or failure recovery, and thus is conceptually very different.
有人可能会说,多线程应用程序需要处理这些相同的问题。然而,它们间有一个细微的区别。在多线程应用程序中,操作调用的不确定性没有真正的来源。应用程序程序员在需要时可以完全控制调用顺序。分布式系统本质上引入了真正的异步操作调用。此外,非分布式系统,即使是多线程的,也位于单个操作系统之上,该操作系统可以帮助对象之间的通信,并可以用于确定和帮助同步和故障恢复。另一方面,分布式系统没有单一的资源分配、同步或故障恢复点,因此在概念上非常不同。