Experiences with Data Distribution Management in Large-Scale Federations
Bill Helfinstine
Deborah Wilbert
Mark Torpey
Wayne Civinskas
Lockheed Martin Information Systems
Advanced Simulation Center
164 Middlesex Turnpike
Burlington, MA 01803
781-505-9500
, , ,
Keywords:
HLA, RTI, DDM, scalability, Joint Experimentation
ABSTRACT: The JSAF Federation is one of several simulation tools at US Joint Forces Command used to support large-scale simulation exercises for Joint Experimentation. Given both the large-scale nature of joint operations (i.e., theater-level conflict) and the forward-looking aspect of the Joint Experimentation process, these experiments inherently involve the simulation of many thousands of weapons systems platforms. Although some experiments are performed at an aggregated level of representation, other experiments require representation at the weapons system platform level. Two recent experiments, J9901 and Attack Operations 2000 (AO'00), required the simulation of 20,000 to 100,000 objects using over 100 federates.
The HLA Data Distribution Management (DDM) service provides an abstract, application-driven data filtering capability. The JSAF Federation has taken advantage of this capability to provide a highly scalable, widely distributed simulation system. This paper discusses why DDM is important to scalability, how the JSAF Federation uses its filtering capabilities in a data-driven fashion, and the RTI features and network hardware capabilities that were used to build this system.
1. Introduction
Joint Experimentation is an iterative process of collecting, developing and exploring concepts to identify and recommend the best value-added solutions for changes in the doctrine, organization, training, material, leadership and people required to achieve significant advances in future joint operational capabilities. The US Joint Forces Command supports large-scale simulation exercises for Joint Experimentation. This kind of exercise is invaluable for conducting theatre-level training in joint operation as well as for providing experimental data to evaluate the potential effectiveness of future weapons systems and strategic approaches.
The JSAF federation, which is one of several simulation tools used at US Joint Forces Command, unites a diverse set of federates to conduct large-scale distributed simulation at the weapons system platform level. Because of the joint nature of the exercises, different participants bring different federates which have been developed to suit specific needs.
Joint experiments tend to be conducted among geographically dispersed sites, as each participant brings local resources to bear. Thus, WANs, and their associated costs and limitations, are usually incorporated into the communications infrastructure of joint experiments.
Theatre-level exercises operating at the platform level require the simulation of many entities. For example, the Joint Operations 1999–1 (J9901) exercise and the Attack Operations 2000 (AO00) exercise required the simulation of 20,000 and 100,000 platforms respectively [1]. These exercises were designed to investigate human performance in the scenario of Critical Mobile Target detection, given differing levels of computer analysis support, human-machine interfaces, communications capabilities, and attack capabilities.
1.1 Joint Experimentation and its data requirements
The exact amount of data transferred in a simulation is contingent upon the design of the scenario used by the federation. In order to construct a scenario that demonstrates joint operations in an interesting fashion, it is a requirement that the geographic area played on be quite large. A corollary is that the number of vehicles played be sufficient to populate such a large expanse of terrain sufficiently, particularly when long-range radars are played. Without sufficient background clutter to hide in, a Critical Mobile Target scenario cannot be realistically simulated. Furthermore, since it is important to find individual vehicles in such a scenario, it has become a design requirement that these simulations happen at the platform level of representation. This leads to the conclusion that joint exercises must publish very large volumes of dynamic data, either in the form of changing attributes or ongoing interactions.
Based on these data requirements, as well as the fact that we run with human operators as part of the experiment, our federation designs have resulted in a very DIS-like design [1]. We do not use the time management services of the RTI, since we must run in real-time in order to include the human operators. We do not use the reliable messaging services of the RTI, since we have such a large quantity of data to send, and the overhead of reliable messaging is too high to handle the load. And most importantly, each federate must filter out as much data as possible in order to keep from becoming overloaded.
Joint exercises also require the use of many federates. While the JSAF federation has succeeded in populating the battlespace with the needed entities using a limited number of federates, many additional federates are needed to directly support the exercise participants. Even more importantly, joint exercise participants operate at a high level, and the federates they use provide a theatre-level display of the battlespace. For example, the display may represent the entire sum of situational awareness for a given force. Thus, these federates may need to know the state of all (or at least a significant fraction of) the entities in the federation execution.
This aspect of joint experiments turns out to be a major constraining factor. These display federates need to respond in a timely fashion to commands (e.g., pan and zoom) from users. It is very challenging for the federates to remain responsive at the theatre-level, where the state of thousands of entities may need to be displayed simultaneously. Thus these federates need to be spared unnecessary processing whenever possible, and at the same time be able to change their interests quickly and efficiently.
Without any filtering, simply processing all the incoming state information of a large theatre-level exercise would completely occupy a federate. Thus, it is incumbent upon distributed systems meeting the need of joint experiments to reduce the total amount of irrelevant information which must be processed by federates.
1.2 Declaration Management
One mechanism provided by the RTI, which on the surface appears promising in the context of joint experimentation, is static subscription-based filtering offered by the Declaration Management service. If a division of responsibility between federates can be reflected in the FOM, Declaration Management can be used to form disjoint publications and subscriptions. In this case, the flow of simulation data between disjoint federates could be greatly reduced or even completely suppressed.
One might think that at the theatre level, the responsibilities of the federates would be disjoint in domain (e.g., air vs. sea), in function (i.e., logistics vs. attrition) or in other dimensions. Perhaps some federates in a joint exercise would be so loosely coupled that there is little or no need to exchange simulation data. However, such a scenario does not explore the joint qualities of the scenario in a meaningful way. Thus this solution does not fit well with the problem domain. Furthermore, in practice, the federates that most need filtering (theatre-level displays) must subscribe to the most simulation classes for potential display. While DM can be used to filter out some traffic, it does not offer an adequate solution for the system bottleneck.
1.3 Data Distribution Management
The other service offered by the RTI for relevance filtering of simulation data is the Data Distribution Management service. Because this service provides dynamic value-based filtering, rather than static hierarchy-based filtering, it is well suited for federates which may need to display any entity, but at a given time will likely only be viewing a subset of them.
Although the exact nature of DDM has changed during the development of the HLA specification, the basic capability has remained the same. DDM allows the federate to dynamically specify abstract regions of interest that are overlap-matched against out-of-band data sent along with the object attributes or interaction parameters in order to filter the simulation data presented to the federate.
For example, in J9901 and AO00, theatre-level displays would subscribe for only those vehicles whose type and location was of interest to the user. As the user panned or zoomed the map or selected different kinds of vehicles to display from a menu (red/blue, aggregate/actual, tracks on/off, etc), the federate dynamically updated its subscription. With many federates of this nature (fifty or more) participating in the exercise, the interests of the federation as a whole were updated quite frequently.
1.4 IP Multicast
One technology that has been used quite heavily to implement best-effort DDM services is IP multicasting [2] [3]. The basic unit of abstraction of IP multicasting is known as a multicast group, which is simply an IP address that falls within a specified range of values. When a machine wants to receive data sent to a particular multicast group, it subscribes to that group. Data is sent to a particular group by a sending machine, and all machines subscribed to that group will receive that data. On LAN technologies such as Ethernet, the network inherently supports multicasting, and different network interfaces can handle differing quantities of filtering in hardware. Additionally, wide area networks are capable of using IP multicast through the use of additional subscription and routing protocols that modern routers support, albeit with the higher latencies inherent in WANs.
There are a number of different algorithms that have been used to implement DDM, some of which do a good deal of computation to reduce network bandwidth requirements. However, these computations can become a bottleneck in themselves. For example, it has been shown in [4] [5] that dynamic configuration of the connectivity grid between federates to provide optimal filtering of data is an NP-complete problem. Joint experimentation, with its large numbers of entities, federates, and even sites, is particularly vulnerable to inefficient algorithms, and therefore a non-polynomial algorithm such as the Multicast Grouping method is not appropriate for such exercises.
Thus, the approach used to implement DDM in the RTI is critical for performance of joint federation executions as a whole. DDM is required to reduce simulation data processing, which is a known constraint, but at the same time, the complexity of the implementation of DDM must not in itself become the bottleneck.
Luckily, the RTI is free to implement heuristic approaches to DDM that reduce but do not necessarily optimize data flow. These may offer much improved performance profiles. Also, DDM can opt to filter irrelevant simulation data at many different points: at the source (if no federate is currently interested), at a router (if no remote federates require forwarding of the simulation information), at the network interface (if DDM has appropriately arranged multicast addresses for filtering at the receiver) or in the receiving RTI software itself if necessary.
Past implementations of DDM have leveraged this flexibility to provide meaningful reductions in required federate data processing at manageable computation costs. The use of DDM has enabled larger scale exercises than were previously possible with DIS-style state update broadcasts, and requisite application-level processing. The following sections of this paper describe our experiences using different implementations of DDM to successfully conduct these large joint experiments.
2. Best Effort DDM implementations using IP Multicast
The ability to send data to a dynamically controllable subset of possible receivers is a powerful tool that can be used in a number of ways to implement the DDM abstraction. However, there are a number of limitations of the current implementations of IP multicast that constrain its use. In particular, the total number of multicast groups that can be used simultaneously is less than 6000 in current (2001) commercial router implementations. Furthermore, the amount of dynamicism of group membership is limited as well, with current routers being capable of less than 1000 subscribes or unsubscribes per second from all the client machines. In the routers and switches used in the J9901 and AO00 exercises, those limits were only 3000 groups and 300 subscription messages per second, which proved even more limiting than current technology.
Thus, the subject of just how to map the DDM Region and Routing Space abstractions onto multicast groups has been and is still a subject of a good deal of research [6] [7] [8] [9] [10] [11] [12] [13]. Furthermore, each design used to implement this mapping has its own limitations and drawbacks. There are two general categories of such mappings, statically assigned grids and dynamic assignment based on connectivity requirements.
2.1 Statically Assigned Grids
The simplest way to map abstract regions onto multicast groups is to divide the entire routing space into an n-dimensional grid, and assign a multicast group to each n-dimensional grid square. Therefore, when a message is sent, the interests of the receivers are implicitly taken into account by the simple expedient of moving their multicast subscriptions to cover the groups that are overlapped by their subscriptions. This is the approach taken by the STOW RTI-s prototype, as well as current versions of the RTI-NG 1.3 [10] [14].
This design has the advantage of requiring very little computation to do the mapping, as well as requiring no extra communication between the senders and receivers. Therefore, its scaling characteristics are very good, which is a major advantage to joint experimentation.
However, its disadvantages can be problematic. One characteristic of a typical exercise is that the density of objects across the space is very non-uniform, with rear areas containing only logistics vehicles, and scattered battles containing a large number of platforms in a small space. Therefore, to get good coverage of regions of high density, a large number of grid cells are wasted on regions of low density, where they are much less necessary. Since multicast groups are a limited resource, this can be a critical problem. One way of working around this limitation is to provide a mechanism to specify areas of higher grid density within the grid [15]. This allows more efficient use of the limited multicast group resource, at the expense of significantly reduced dynamicism in the simulation scenario.
Further, this design leads to scalability problems in the presence of non-point publication regions. In order to satisfy the receipt requirements, the data must be duplicated and sent to each group overlapping the region, which leads to obvious inefficiencies.
Finally, this design inherently results in federates receiving more data than they asked for. Since regions are implicitly expanded to grid boundaries when subscribing and publishing, each federate will receive more data than it asked for, and receive-side filtering must be done to remove extraneous data. This effect can be reduced but not eliminated by careful selection of grid densities, but that carries its own penalties of reduced dynamicism.