Interaction within Multimodal Environments in a Collaborative Setting
Glenn A. Martin, Jason Daly and Casey ThurstonInstitute for Simulation & Training, University of CentralFlorida
3280 Progress Dr., Orlando, FL32826
{martin, jdaly, cthursto}@ist.ucf.edu
Abstract
Much attention has been given to the topic of multimodal environments as of late. Research in haptics is thriving and even some olfactory research is appearing. However, while there has been research in collaborative environments in the general sense, minimal focus has been given to the interaction present in multimodal environments within a collaborative setting. This paper presents research into a network architecture for collaborative, multimodal environments with a focus on interaction.
Within a dynamic world three types of human interaction are possible: user-to-user, world-to-user, and user-to-world. The first two form what are fundamentally multimodal environments; the user receives feedback using various modalities both from other users and the world. The last one forms the area of dynamic environments where the user can affect the world itself. The architecture presented here provides a flexible mechanism for studying multimodal environments in this distributed sense. Within such an architecture, latency and synchronization are key concepts when bringing multimodal environments to a collaborative setting.
1Introduction
Virtual environments have existed for many years. However, despite numerous distributed simulations, including training uses in many domains, there still is relatively little interaction within these virtual environments. Most allow only navigation and perhaps some minimal interaction. For example, military simulations allow basic combat-oriented interaction (shooting, throwing a grenade, etc.), but only few of those incorporate advanced modalities (such as haptics or olfactory) or dynamic changes to the environment itself.
2Types of Interaction
Within a distributed, interactive virtual environment there are many components to address: the user, the environment and other users. In categorizing the types of interaction we have split them into four categories: user-to-user, world-to-user, user-to-world and world-to-world. The first two types form what are fundamentally multimodal environments. Whether from another user or from the world, the user gets feedback from the various modalities. In the case of user-to-user it may be grabbing a shoulder or shooting an opponent. For world-to-user it may be bumping into a piece of furniture in the virtual environment.
The latter two types of interaction form what we call dynamic environments. For example, the user can act upon the world by moving an object or changing the characteristics of an object (e.g. watering a virtual plant making the soil damper). Similarly, within a world-to-world interaction a rain model might change the soil moisture. In this paper we will focus on the user-to-world interactions, however.
3Multimodal Environments
In terms of virtual environments, a multimodal environment is one that the user experiences and interacts with using more than one sense or interaction technique. Strictly speaking, an environment that provides visual and auditory feedback is a multimodal environment; however, the more interesting environments are those that employ haptic and/or olfactory feedback in addition to visual and auditory. Gustatory, or taste feedback, is also possible, but there is almost no evidence of this in the literature, so we will not discuss it here. On the interaction side, systems that take input using more than one technique can also be considered multimodal. For example, a system that provides a joystick for locomotion as well as a speech recognition system for accepting commands to alter the environment can be called a multimodal environment. In this section, we focus on the senses and how the user experiences a multimodal environment, as well as a software architecture that supports the modeling and design of multimodal environments in a seamless way. Though it is difficult to develop a clear metric to determine whether adding more senses contributes to a greater sense of presence or immersion, there have been studies conducted that suggested this (Meehan, Insko, Whitton & Brooks, 2002). Also, Stanney et al. have shown that adding sensory channels allows the user to process more data than the same amount of information presented over a single sensory channel (2004).
3.1Visual
Apart from a few specialized applications, a virtual environment always provides visual feedback. The visual sense is essential for navigation and detailed exploration of the environment. While other senses may introduce an event, entity, or object to investigate, it is ultimately the visual sense that will be used to analyze and deal with the object of interest. For this reason, an environment designed to be multimodal cannot ignore or diminish the importance of vision, and an appropriate amount of resources should be spent on creating high-fidelity visuals. Visual feedback is essential to all types of collaborative interaction. Signals, gestures, and other visual interaction allow user-user collaboration. The primary means for the user to experience the world is through visual exploration. Finally, the ability for the user to visually perceive the world is a prerequisite for user-world interaction.
3.2Auditory
Most modern environments also provide some level of auditory feedback. It may be as simple as a beep or click when a certain event happens, or it may involve realistic, spatialized sounds coordinated with virtual objects and enhanced with environmental effects. The auditory sense is a very useful and natural channel for informing the user of events that occur outside his or her field of view. Simple speech is probably the most natural user to user interaction in a collaborative environment. With a high-fidelity acoustic rendering, the user may also be able to identify and keep track of one or more sound sources without relying on vision (some examples are a bird in a tree, a splashing stream, or gunshots from a sniper positioned in a second-story window). Also, environmental effects such as reflection and reverberation can give clues to the size and composition of the immediate environment or room. These are two examples of world-user interactions using the auditory channel.
3.3Haptic
Haptics, or the sense of touch, has more recently been introduced into virtual environments. The term “haptic feedback” covers many different types of feedback. One is tactile feedback, which consists of physical stimuli to alert the user’s sense of touch to an event or environmental feature. Tactile feedback can take the form of vibrators for general tactile events, thermal heating pads for providing thermal feedback, or electromechanical tactile displays for allowing the user to feel the physical characteristics of a surface. The other form of haptics is force feedback. This type of feedback physically exerts force on the user’s body or resists his movements. Force feedback allows the user to grasp virtual objects and feel their structure or to explore a deformable virtual surface in high detail with a stylus. Haptics open up a whole new range of user-user interactions. Using a force-feedback device one user can grasp or tap another user’s shoulder, which the other user may perceive via vibrotactile feedback. This kind of non-verbal communication may be useful for soldier training simulations. Similarly, a wide range of world-user interactions open up with haptics. Haptics may allow a user to feel low-frequency vibrations from explosions, detect heat coming from a fire behind a closed door, or feel the bones underneath the skin in a medical simulation. Highly realistic user-world interactions can also be facilitated by haptic devices.
3.4Olfactory
Olfaction, or the sense of smell, is just beginning to appear in multimodal environments. The psychological effects of certain scents is well-documented in the literature, so one could infer that these effects, as well as others, may be useful for creating more realistic environments, or for creating certain desired effects. Technologically, olfaction is somewhat restricted in that any scents to be presented must be packaged and delivered using some form of technology. While these technologies have progressed somewhat in recent years, they still can only deliver a few distinct scents each. It would be desirable to be able to use a small set of basic scents that could combine to create a wide variety of compound scents. However, no conclusive work has been done to accomplish this in a general sense. Another issue is that scents presented to the user must be able to dissipate quickly or otherwise be removed from the environment. If scents are allowed to linger and combine, they could lead to simulator sickness issues, especially if the scents used are relatively strong. In terms of interaction, olfaction can be a particularly powerful world-user interaction method if the scent used has a natural association with the virtual scent source.
3.5Software Architecture
Each modality requires a certain software technique to efficiently model the world and present it to the user. The visual sense is well-studied and several techniques are available to handle visual modeling and rendering. The scene graph is one such technique and works well in a general-purpose software system. The auditory sense has also been dealt with in detail, and several different audio libraries have emerged as a result. All of them tend to model the world using a single listener and many sound sources, providing essential features such as spatialization and attenuation by distance. Depending on the library, other features such as reflections, reverberation, and occlusion may be available as well. Haptics is not nearly as well-developed as the previous two modalities, and the task of characterizing haptic interactions and providing a generic software platform to support these interactions is quite challenging. Fundamental to haptic interactions is collision detection, so a large part of a haptic system would require surface modeling (similar to visual modeling) and intersection tests between surfaces. Furthermore, because haptic interactions require very high frame rates (around 1000 Hz), these intersection tests and collision responses must be handled very efficiently. Finally, olfaction is very new to virtual environments; however the techniques required are very similar to those used by the auditory sense. The same source/detector paradigm works well for sources of scent in an unobstructed environment. When the environment is divided into different rooms, occlusion of scent sources may be needed (depending on the “ventilation” properties of the virtual environment).
To deal with multiple modalities at once, it helps to have a software architecture that models the world as a seamless whole, but contains all the necessary features to produce the appropriate stimuli over the various modalities. For example, a virtual object is typically visible, but it may also produce sound or a scent, and it may be possible to grasp it or throw it at a different user. In a system which models the world separately for each modality, this would require two distinct models for the visual and haptic senses, plus coordination of the motion of the visual and haptic models, as well as the positions of the object’s sound source and scent source. With a unified architecture, the virtual object can be represented as a single object, containing attributes that enable it to be rendered over the various modalities. A geometric representation can be used for both the visual and haptic senses (although the haptic model may need to be simplified for fast haptic rendering). All the parameters of a sound source can be encapsulated into another attribute, while the scent parameters for the scent source can be carried by a third attribute. On the user’s side, the structure is similar. Attributes encapsulating the user’s viewpoint, sound listener, and scent detector are attached to the virtual object which represents the user (the user’s avatar). See Figure 1.
This structure is essentially an augmented scene graph. Besides the normal scene graph hierarchy of scene, groups, and geometry, there are various attributes that can be attached to the components of the scene graph. Certain components in the scene form the root of subgraphs that represent virtual objects of interest. To give the virtual object multimodal features, we need only to create and attach the multimodal attributes that we need. After the scene graph is manipulated in preparation for drawing a new visual frame, the various multimodal attributes are updated based on the latest position, orientation, and other relevant state of the component to which they are attached. Thus, the visual rendering is handled just as it would be in a visual-only scene graph system. Audio rendering is updated based on the new global positions and orientations of the virtual object and listener, and environmental effects can be set based on attributes located elsewhere in the scene. Scent generation is done in a similar manner, so the strength of scent deployment is scaled based on the distance between the user and the various scent sources.
Haptics is the only modality that doesn’t directly benefit from this structure. As mentioned previously, haptic interactions are rather difficult to generalize into a supporting software architecture. We are currently progressing toward this goal, however. Haptic interactions can be split into three phases: the event that causes a haptic interaction, the mapping from event to hardware interface, and the activation of the specific hardware to render the event. The event comes from a taxonomy of possible haptic events, and the rendering depends on the various technologies available. Any architecture supporting haptic interfaces must be extensible enough to allow additions to both the kinds of events that can occur (for new interactions not previously considered), and the software interface to the technologies available (for new devices that are created later). The most challenging aspect of haptic rendering is the mapping from event to hardware. Different mappings must be used for different types of events. For example, a positional tactile event occurs when a virtual object touches the user’s body. This can happen from a wide variety of events, such as a simple collision (walking into a stationary virtual object), non-verbal communication between users (tapping your friend on the shoulder), or checking a downed soldier’s pulse and body temperature. In contrast, there are directional tactile events, such as feeling the blast of a nearby grenade explosion, or feeling the heat emanating from a campfire. Both of these event types could be rendered by some kind of tactile device (a vibrator or pneumatic actuator) or a thermal device in the case of heat effects. An entirely different kind of event occurs when you consider kinesthetic interactions, such as grasping a virtual object with a force-feedback glove or simulating the inertia of manipulating a heavy virtual object with a force-feedback joystick. However, the system must still map these interaction events to hardware feedback.
While the mappings from event to device will need to vary by the type of event and the specific effect that is desired, it is possible to envision different classes of hardware that may be able to use the same mapping and deliver very similar effects. The differences in these classes come only from the capabilities and features of the hardware. For example, there are several examples of vibrotactile devices. Some of them can vary frequency, some can vary amplitude, and some can only be turned on and off. The same mapping for a shoulder tap event, for example, can be used for all of these devices. The desired effect for a vibratory device would probably be a low-gain pulse at a frequency that makes the effect as mild as possible. Some devices could reproduce this with high fidelity, while others could only manage to turn the device on and off a few times. Nevertheless, the mapping from event to device would be the same. This is one area that could be handled by a general-purpose haptic software infrastructure, and this is part of the work we are carrying out at this time.
This vision of a haptic software architecture fits in with the multimodal architecture detailed above in a loosely-coupled manner. The simulation would need to detect haptic interactions between users or the world and the user, characterize these events, and send them to a separate haptic mapping and rendering module. This is similar to the manner in which sound and sources are updated (albeit more complex). The listener and/or the source of sound move and the new locations must be sent to the sound hardware to keep the sound feedback consistent with the system’s model of the world. Similarly, if the user interacts physically with the world or another user, the interactions must be rendered haptically for the feedback to be consistent with the model of the world.
4Dynamic Environments
In contrast to simple static environments (including the multimodal environments described in the previous section), dynamic environments allow the user to interact with them and alter their composition and characteristics. This is the essence of user-world interactions. A system that allows the user to interact with the environment is much more engaging than one that never changes, despite what the user does. In this section, we will present the various types of dynamic user-world interactions and describe some of the available software that supports them.