T H E G e n S i m P R O J E C T : A proposal for a highly interactive, general purpose, Total Sensory Emursion simulation environment. I. GenSim Project Goals. The goal of the GenSim project is the construction of a total sensory emursion (TSE) artificial reality simulator. The entire sensory field of the user ( encompassing sight, sound, touch, smell, pain, taste, balance, hot/cold, g-force, esp etc.) could be controlled and be totally programmable by the simulator. In this way any arbitrary real or imaginary environment could be experienced by the user. In the GenSim environment the user interface is totally naturalistic and general. Inputs to the simulator consist of bodily movements and forces (ranging from walking motions to eye movements to pressure exerted on an object) , user vocalizations and other user sounds, and any other inputs considered relevant such as brain waves, skin temperature or resistance, pulse rate etc. GenSim allows the user to interact with the simulator as he interacts with real life. For instance consider a simulation of the Amazon jungle. The user could walk around the environment just as he would if he were actually there. The simulation controller constantly scans the user for motion and other inputs. As the user turns his head, the visual display is updated for the new viewing angle and postion. The sweet perfume of a wild orchid fills the humid atmosphere. Suddenly, the roar of a jauguar, which experiences a doppler shift due to rapid relative motion, is introduced in the audio field at 9 o'clock. As the user hears the ripping of the cats powerful jaws, the pain transmitters are activated..... The ultimate goal of the GenSim simulation designer would be to satisfy the Jarvis simulation criterion. You may recall the Turing Test for machine intelligence. The Turing Test was a criterion for the evalution of computer simulation of human intellectual behaviour. If a man conversing with a Teletype could not distinguish whether he was talking to a computer program or a human, then such a computer could be said to be "intelligent", that is produce an accurate simulation of the human intellect. The Jarvis Criterion generalizes this concept to any simulation environment. A simulation will pass the Jarvis test if in the users perspective, it is indistinguishable from the reality it purports to represent. In an updated version of a successful Turing Test which would satisfy the expanded Jarvis Criterion, the man conversing with the teletype would not only be unable to tell if he was talking to a human or a computer, but also he would be unable to decide whether or not the teletype, the room, or in fact he himself existed in the form observed! The GenSim system relies on Total Sensory Emursion (TSE) for its user output displays. In its purest form there is no hardware cockpit, control panel, view window etc. Since we control the users sensory field we software simulate the cockpit, control room, jungle or whatever scenery is required. In this way GenSim is totally soft- ware reconfigurable. One second the user is in an F-15, then instantly he is transported to a medieval dungeon. This is technically feasible today for audio-visual stimuli. Tactile simulation is very difficult with today's technology. Perhaps GenSim simulators in the future will have programmable force fields to provide true tactile simulation, or perhaps they will utilize Direct Brain Stimulus (DBS) for all user outputs, eliminating external sensory stimulation hardware entirely, in essence, creating a direct mind-machine link. The applications of GenSim encompass a broad spectrum of activities ranging from recreational (games of all categories from sports to war simulation ), educational (from participatory historical studies, to lecture hall simulation, to walking around an atom or perhaps the milky way), to job training (from combat pilots to bus drivers). Also a wide range of experiential applications can be envisioned ranging from lunar tourism to religious ritual. II. Implementations There are two basic approaches to GenSim implementation: Internal or External sensual field stimulus/response. Internal systems would communicate directly to the central nervous system (brain) for both inputs and outputs, while external systems would rely on stimulation of the body's external sensual receptors and externally monitoring bodily outputs. A. Internal I/O Direct Brain Stimulus/Output (DBS/DBO) holds the greatest promise for ultimate simplicity and generalization to all sensual inputs and outputs. Such a direct man-machine link offers the ultimate in fidelity and accuracy, obviating the need for bulky external stimulation apparatus. Of course such techniques also appear to be relatively far-off on the technological horizon. In a typical DBS system normal input from the sense organs would be neutralized and an articial signal injected at the proper site in the nervous system. DBO output systems would interface and decode central nervous system outputs eliminating the need for transducer suits, audio pickups etc. For human safety and confining purposes it may be necessary to decouple these outputs from the bodily functions they control. Although an extremely promising and versatile technique, the state of the art in DBS/DBO is rather primitive. Current DBS research has been focused on the audio or visually handicapped. Hearing has been partially restored by electrical stimulation of the inner ear, and similar work with the optic nerve has been demonstrated. Major areas of needed research involve signal introduction techniques and human central nervous system signaling protocols. Currently, surgically implanted electrodes are used for signal introduction, and this represents a drawback that needs to be corrected. In the DBO area work has been primarily in prosthetic control for the handicapped, and in medical monitoring systems (brain waves, heart rates etc.) Much research needs to be done in the DBS/DBO areas before it can be considered. But who knows. In some future GenSim sports game grid, the toughest players may be quadraplegics using DBS/DBO I/O. B. External I/O External sensual stimulators and inputs have been well developed over the years, especially in the audio-visual area. The biggest problems here is not in the actual interface, but in processing of input signals, and generation of simulation outputs in real-time (more on this later). Audio output would be well served by a Walk-Man type stereo headphone system. Psychoacoustic research could decide if this treatment of each ear as a point receiver is sufficient or if more channels are needed. Microphones mounted on the simulators body would provide adequate audio input. Like a video equivalent of the Walk-Man, the external video stimulator is a headset mounted in front of the eyes encompassing the users entire visual field. The ideal external video stimulator would be an electronically programmable holographic generator. Such a device could provide true, full-color, binocular perspective with all depth cue (focal) information. A more primitive but realistic approach would involve twin color displays (CRT or other technology) as a stereo visual output device. Such a display would be fixed focus resulting in a loss of depth information. However, if by scanning the users eyes we could determine the focal plane, then by utilizing a variable focus optical system we could alter the focal plane of the display to match the users. Next the system would blurr images in the monitor that are out of the focal plane, in effect computing the perceived focal change. Other possible ways of retaining focal information could include systems with electronically controlled mirrors or lenses. Visual input to the system could be easily accomplished through a television camera, however here again the problem is not the interface but the signal processing. Other I/O areas such as smell, taste, tactile, balance etc. all represent unique challenges to the flegling GenSim designer. Successful external I/O in these areas requires such far out concepts as force field generators, or perhaps some sort of programmable space suit. Anybody know a good universal odor generation algorithm? C. 1984 Implementation The state-of-the-art in 1984 only allows us to realistically implement a subset of the GenSim concept, primarily focusing on those areas which are technologically advanced (cheap and dirty)... namely audio and video. We leave as an exercise to the future the implementation of such things as general purpose olfactory stimulators and programmable tactile force fields. Simulator to human output will consist of stereo audio and stereo video channels in a user headset with wireless transmission link. Video is provided by dual color CRT's on a fixed focal plane providing a stereo image. The audio field is generated with a speaker unit for each ear. Human to simulator inputs recognized are audio (by way of a headset microphone) and bodily motion (scanned by means of a keypoint detection system). Keypoint sensors are attached to bodily points of interest (head, hands, feet, joints etc.), and a 3-D triangulation system locates these points in 3-space. Input data is transmitted through a wireless link to the simulation controller. Audio input would be processed by a standard limited vocabulary isolated word speech recognition system. The simplest audio output processing scheme would involve treating all audio sources in the environment as point sources. Amplitude would diminish with the square of the distance from the source, and also with ear angle away from the perpendicular. A transmission co:efficient would be associated with each ear angle in relation to the source. In this way amplitude could be calculated for all source-receiver relationships and computation would be quite minimal. Of course, such a simplistic model would only be truly accurate in an area of empty space, and could not simulate environ- mental acoustical effects such as surface absorption, reflection, diffusion etc. Truly massive computer resources (in 1984 terms) would be required to rigorously simulate such effects using ray tracing or other algorithms in real time. The Jarvis criterion does not require us to be rigorously correct, all that is required is that the average user be fooled. Surely some clever approximations will be formulated to allow us to simulate virtually all common audio-environmental interactions to a degree sufficient to baffle all but the most perceptive eggheads. If rigorous sound output processing approaches an infinitely compute bound situation, video output processing approaches an infinite times infinity in complexity. Current real-time graphics systems ( as found in flight simulation, video games and the like) are really only capable of processing a few thousand to tens of thousands of surface details ( typically represented by a point or polygon vertex) in real time at acceptable frame rates. ( >=30hz ). Just representing accurately the bark on a small tree trunk would be beyond the range of today's systems. In our 1984 version of GenSim we would utilize two currently available polygon-type systems at 512 x 512 resolution with full anti- aliasing. Each system would be dedicated to an eye to form a stereo image. User movement and speech input would provide the input data to define the viewpoint location and angle. A magnetic disk data-base could provide an extensive simulator domain, however any given viewpoint would exhibit the fairly sparse, fuzzy blocks-world type look common to current simulators due to limited detail processing capabilities. The simulator controller coordinates all the various I/O and communications subsystems at a fairly high level, relying on the video, audio, motion scanner, and speech input processor to do the real dirty work. The simulation environment would ideally be a completely empty programmable 3-Space, so there would be no fixed hardware reality getting in our way. Lacking such an environment on Terra Firma, a useful but more limiting substitute would be an infinite 2-D plane such as that approximated by a large parking lot, or large, flat indoor room such as "The Fog" in the film THX-1138. This would allow simulatees to explore vast realms, however there would be some risk of collision between users. Each simulatee would don his headset and be assigned a transmission channel and simulation controller. The user would have complete freedom of motion within the domain and could run, jump, kick, sitdown, go to sleep, whatever. The biggest dis- advantage to such a setup would be the lack of terrain variance, the risk of simulatee collision, and the need for a very large space. An alternative to the infinite plane approach, is some sort of confinement scheme. One fairly obvious idea is to create a 2-dimensional tread- mill. The user would walk on a x-y scrolling mat of say 4x4 meters. The mat angle would be mechanically pitched to simulate various terrain slopes. A safety cage would confine the user to prevent falls from the platform. The treadmill mat would scroll as the user walked to keep him roughly centered in the platform. An alternative proposal would place the user inside a large sphere which rotates freely in all dimensions with programmable resistance. The user moves by rolling the ball from the inside, not unlike a hamster running a wheel. The differing spinning resistances emulate varying terrains. D. A Future Video Implementation To truly satisfy the Jarvis criterion for the area of video output, we need an output system that can match the resolving 3 4 capabilities of the human eye (on the order of 10 to 10 lines). Such a visual field could contain at maximum 1,000,000 but certainly no more than 100,000,000 elements of detail. Of course we may have to process many times this number of elements in producing the final image (allowing for obscured surfaces, averaging of distant points etc.) Through clever clipping and deresolution algorithms for distant imagery we could limit our data to no more than a billion elements per frame for a super high resolution image, to less than 10,000,000 for lower resolution systems. For the sake of discussion let us assume we could acheive the lower limit of 10,000,000 active elements of screen detail to be processed at a 30hz rate. If we ignore dynamic lighting effects (changing light sources, moving objects casting shadows, reflections, refractions etc.) we would minimally need the equivalent of a 4x4 matrix computation for each data element each frame. This corresponds to an arithmetic computational load of 4.8 billion multiplies/sec. or 210 pico- seconds per operation. This is approximately 100 times the speed of today's mainframes, and several thousand times the speed of the fastest microprocessors. Of course such speeds may well be within the range of 21st century mainframes. Ram requirements for a 10,000,000 element buffer with 32 bit X,Y,Z and color precision would require 16 bytes/element for a total of 160 Mbytes, a relatively modest requirement. If we were to provide an environmental resolution of 50 lines per cm., this 2 works out to up to 25,000,000 picture elements per meter . So 2 for an environment the size of a football field (10,000 meter of gross surface area) we would need a data base of 4 terabytes. (4000 gigabytes). Utilizing optical disk technology, such a data-base is feasible today (400 10 gigabyte drives) although quite expensive. Such a system, quite obviously will involve massive data bandwidth. A user merely rotating 360 degrees may require access to a large part of the 4 terabytes. Data rates greater than 1 terabyte/sec. may need to be handled. One way to greatly reduce the data flow would be to store precomputed deresed zones. For distant areas we would access progressively further deresed data, vastly reducing our data rate and computing requirements. Of course this would increase data storage required somewhat. A logical way to deal with such a compute bound, data-rate bound system is to inject a large amount of parallelism into the design. A matrix of processors, each linked to a disk drive within a matrix of drives, could be implemented. Each processor would be dynamically allocated by the master video subsystem controller to a certain volume of space in the present view. Spaces close to the observer would tend to have many more processors per unit volume (because of fully resed data to handle), then distant deresed areas. As the user moves through the space, processors whose volume is no longer in view would be reallocated to areas coming into view, or areas being upresed. The reallocated processors then access their respective disks for new data. The processors create a common video image through communication of final point-color data to a common z-buffer. The z-buffer arbitrates obscurations, translucencies, transparencies, inter-zonal averaging, and antialiasing (if necessary). A complete subsystem consisting of processing matrix and z-buffer is needed to form each eyes image. Another problem we have to address is data-base acquisition. Where are we going to get all this data? Will we require 20,000 data-age slaves to construct such an electronic pyramid? Obviously some form of automated or human assisted automated approach must be formulated before we can don our GenSim headsets. One approach that would be quite automated is something I refer to as a Reality Scanner. This device scans an actual object from a sufficient number of viewpoints, and then constructs a 3D data base of its surface points (x,y,z, and color properties). ( Not unlike the image processing performed by the human eyes.) Technoligies could involve ultrasound, radar, laser, visual light, etc. Such scanners could provide a wide range of natural, and also supernatural data through data manipulation and distortion by computers. Another approach to data acquisition would be computer generation of scenery. Geometric and other basic mathematical environments could be readily formed, as well as the more irregular and surrealistic environs generated through pseudo-random algorithms such as the well known fractal scenes. This brings us back to our first idea, the human approach. Of course, hand entering 250 billion pixels is quite out of the question. By providing an artist's tool kit approach, the artist will be able to manipulate and control both computer generated data, and reality scanner input allowing the creation of made to order worlds, both imaginary and "real". III. Conclusions The GenSim project represents a challenge to those bold and imaginative enough to advance simulation technology from its present armchair view-window state. The Total Sensory Emursion environment provides for a totally software programmable simulation. This allows complete reconfigurability at electronic speeds. Naturalistic input/output allows for high levels of user inter- action without specialized training. With continuing cost re- ductions in digital computers and information storage, it is only a matter of time before GenSim subset systems are implemented. The only question remaining is .... Are you ready to put on your headset ???