Since the advent of modern computer-aided design software, engineers have been divorced from the highly collaborative environment previously enjoyed. Today's highly complex designs require modern software tools and the realities of a global economy often constrain engineers to remote collaboration. These conditions make it highly impractical to collaborate locally around physical models. Various approaches to creating new collaboration tools and software, which alleviate these issues, have been tried previously. However, past solutions either used expensive hardware, which is not widely available, or used standard two-dimensional (2D) monitors to share three-dimensional (3D) information. Recently, new low-cost virtual reality (VR) hardware has been introduced, which creates a highly immersive 3D experience at a tiny fraction of the cost of previous hardware. This work demonstrates an immersive collaborative environment built using a network of this hardware, which allows users to interact with gestures virtually and conducts a study to show its advantages over traditional video conferencing software.
Introduction
As discussed by many researchers, communication during the engineering design process can be one of the biggest challenges faced by a design team [1]. This problem is further compounded by the realities of a global economy, which may require the design team to work with team members of suppliers who are not always present physically [2]. These distributed design teams rely heavily on a variety of modern communication tools to create a cohesive workable design; however, the collaborative environment that currently exists is much more limited compared to that of the 1960s and 1970s when designers could hold an impromptu meeting around a physical design artifact, such as a drawing or scale model, to share a new idea or address a problem. Today's design challenges have also become more complex, requiring more sophisticated tools to solve, and producing more complicated three-dimensional (3D) designs [1]. However, these distributed design teams are often limited to communicating with remote members via modern video conferencing software such as skype or webex. At their best, these systems typically use a shared two-dimensional (2D) view of one participant's screen or application and a single mouse pointer for all the participants to share.
Virtual reality (VR) has been investigated as a solution to various engineering design problems as early as 1993 [3], and even to improve engineering communication and collaboration as early as 2000 [4]. However, these early systems utilized technology, which is expensive and incurs high overhead costs to run making it prohibitively expensive for companies to provide their design teams with general access [5]. Recently, new consumer grade VR hardware has been produced with much lower costs and fewer barriers to entry [6]. This new hardware could pave the way for more general access to VR technology and hence VR tools for design teams. This work presents a VR collaboration tool built with new consumer-grade VR hardware as well as a study to compare the performance of the new VR tool against current video conferencing software. This paper will proceed as follows: Section 2 will review related research, Sec. 3 will provide an overview of system design and implementation, Sec. 4 will explain the methodology used in the comparative study, and Sec. 5 discusses the results and implications of the study.
Motivation
The literature from the past 25 years records an array of investigations on using VR to improve various product and process design activities. The areas of virtual manufacturing, virtual assembly, and virtual disassembly have received some of the heaviest focus [7–10]; however, the use of VR as a communication tool has also long been a topic of interest and investigation [11]. This is likely due to VR's advanced visualization capabilities.
An important underpinning of effective communication and collaboration is creating a shared understanding of the problem. Chellali et al. refer to this as a common frame of reference (COFOR) [12]. Depending on the issues under discussion, language can sometimes be sufficient for establishing the COFOR. Sometimes, vocabulary must grow to provide finer distinctions. One example of this is that languages that developed in arctic areas have many more words for distinguishing snow and ice characteristics than do languages that developed in more temperate areas [13,14]. This trend is observed as well in many subdisciplines of engineering, often called jargon. However, as the engineering problems and solutions become more complex, language and lingo alone are not sufficient to fully establish a COFOR and collaborators turn to visual representations. Until the advent of modern computer-aided engineering (CAE) software in the 1980s [15], most visual representations were either 2D drawings or physical 3D models. Modern CAE tools allow designers to create digital 3D models of a particular design without having to build a physical model.
Unfortunately, these complex 3D designs are still most often viewed on a 2D computer screen, which requires advanced spatial reasoning skills to piece together a 3D mental model of the design [16]. Past studies with VR have shown that people are able to understand these complex 3D designs faster and more completely in an immersive virtual environment (VE) where they can see the design in 3D. Satter and Butler showed that users in an immersive stereoscopic display were more adept at navigating the environment as well as finding and repairing errors inside the environment [17]. Berg and Vance showed that a design team, using an immersive VR environment for three design reviews, was able to identify issues and propose solutions not found or solved with traditional CAE tools [7]. In addition, the design team commented that the immersive environment encouraged and increased team engagement. Bochenek and Ragusa found that design reviews, which included stake holders, such as the end consumer, were more successful when held in an immersive VE than in a more traditional 2D presentation [5].
In 2000, General Motors in conjunction with HRL Laboratories created a system for Distributed Design Review in Virtual Environments termed DDRIVE in order to leverage the benefits of VR for remote collaboration [4]. DDRIVE linked the cave automatic virtual environment (CAVE) at HRL with the CAVE at GM and the CAVE at the University of Illinois. It also allowed people to participate in the distributed design review via a more standard workstation, but only in a 2D manner. While this project proved successful, Daily et al. found that the network capacity of the time was a limiting factor [4]. Bochenek and Ragusa also point out that given the high cost of traditional immersive VR hardware, careful cost/benefits analyses need to be performed for potential application [5].
Twenty-five years later, network bandwidth has improved significantly. However, traditional CAVE setups are still large capital investments for companies and hence have not seen widespread adoption or availability, limiting their potential impact. Recently, new consumer-grade VR hardware has been introduced to the market, which provides an immersive experience similar to a CAVE for a single person. However, unlike a traditional CAVE, this new hardware is significantly cheaper and easier to deploy requiring little to no modification to existing spaces [6].
Given the advantages, immersive VR has shown previously in the literature, this new low-cost consumer VR hardware could prove an important enabler for companies and design teams to begin leveraging VR CAE tools in more consistent daily tasks. However, in contrast to a traditional CAVE where a group can use the CAVE together and see each other, the currently available consumer VR hardware is in the form of a head-mounted display (HMD) which only a single participant can use at a time and their view of the outside world is blocked. This work demonstrates a system where multiple users can enter a shared immersive VE, with each participant in their own headset connected via a network. The VE also includes gesture support to create a collaborative environment similar to that of a CAVE. This work also presents a study that was performed on the system to evaluate the system's effectiveness as a communication tool compared to commonly used tools such as skype.
System Architecture
Overview.
At a high level, the system is built on a client–server architecture as shown in Fig. 1. A client–server architecture was selected over peer-to-peer since it lends itself well to the security and data protection required by large companies with fewer complications. Each client is made up of a single VR hardware set (discussed in detail in Sec. 3.2) and a client program, which exchanges various data packets with the server. In this way, the server can distribute the local data from each client to the other clients in the session. A client can then use the information it receives from the server to update its local version of the VE and animate avatars for each participant. This architecture provides a way for creating a synchronized environment for each participant regardless of where the participant is located. Hence, all participants could be physically present (as would be required for a CAVE) or spread across the country.
While this architecture could support sending data from a variety of inputs (such as motion capture systems, controllers, HMDs, and position sensors), the hardware included in the implementation evaluated here was limited to VR HMDs and motion capture systems for the hands and fingers. The number of users and sensors that could be supported by this system is highly dependent on factors such as client and server processor power, network bandwidth, sensor sample rate, message send rate, etc. These factors would need to be evaluated on an application-by-application basis to determine the limits of the number of users and peripheral devices that could be supported. The hardware used in this system allows the software to create a set of virtual hands for each user as their avatar and send the appropriate data across the network to recreate each user's hands in the VE. This allowed participants in the VE to see the hands of every other participant and see where or what another participant might be pointing or gesturing to. An example of this is shown in Fig. 2. With a more complete motion capture system, more data about the user could be captured and used to create and animate higher fidelity avatars including items such as legs, arms, and a torso. Technology such as that demonstrated by Li et al. could be used to track facial expressions and animate the face of the avatar to match the human participant it represents [18].
While streaming audio through a network is also possible and has been solved many times, playing audio in virtual reality so that it feels like a cohesive part of the VE is a nontrivial issue, which has been the focus of much research. The main issue is the unique shape of each person's ears which alter the waveforms of the external sounds we hear and gives our brain additional clues to help deduce the location of the sound's origin [19]. Currently, measuring the unique audio signature of each person's ear is a time-consuming task requiring specialized hardware. However, there are currently plans to commercialize a much quicker process, which would make the technology more accessible [20]. Given the issues surrounding virtual audio and the fact that our implementation was set up such that users would be in close physical proximity, audio streaming was considered outside the scope and not implemented.
In addition to creating an avatar for each participant, design artifacts such as 3D models were added to the VE. In combination with the avatars, this allows a group of collaborators to meet around a 3D design artifact and see each collaborator's gestures (such as pointing to a location, indicating a size, or showing the movement of a mechanism) in relation to the model. Additionally, the system recognized special input gestures, which allowed a collaborator to draw and delete 3D free-form curves. This functionality provides collaborators a 3D sketch-pad to quickly communicate new ideas, modifications to a design, or even to highlight and annotate design artifacts in the VE.
Hardware.
At a minimum, each client needs a VR HMD and a computer with a network connection to power the HMD and connect to the sever. In this implementation, HTC Vive headsets were used in conjunction with upgraded Windows workstation computers (3.5 GHz, 8 GB RAM, NVidia Geforce 1060 GTX). The HTC Vive was chosen for its capability to precisely follow a user as they walk around a physical space and move them correspondingly in the VE. This functionality allows users to walk around a view within the VE while communicating from many angles. Additional sensors and inputs can be added to increase the fidelity and functionality of the VE. Since the system presented was primarily interested in adding hand gestures to existing communication, motion capture of the hands and fingers was deemed sufficient and a Leap Motion™ controller was used to provide this functionality. The Leap Motion controller was mounted to the HTC Vive HMD and connected to the computer via USB 3.0 as shown in Fig. 3.
The computer used for the server was a standard workstation computer (2.66 GHz, 12 GB RAM). Both the server and the client machines were connected through the local college network with 10 Mbps or greater network connection speeds.
Software.
Software development was done in unity (version 5.4) using standard assets and libraries where possible. Most notably, unity's built-in networking libraries and assets were leveraged to provide the backbone of the networking communication. Where standard libraries and assets were not sufficient, third-party libraries and assets were used. In particular, libraries, assets, and examples provided by Leap Motion were used to interface with the Leap Motion controller. The Orion V3 beta drivers were used for the Leap Motion controller, which provide enhanced tracking, reduced drop-out, and a larger tracking area than previous drivers.
Where standard and third-party libraries were insufficient, additional scripts were added or existing ones modified. For example, Leap Motion libraries were modified to expose the raw location and orientation of the constituent pieces of the local avatar. These data were then serialized, sent over the network, and used to create an identically located and oriented avatar on the remote clients. Additionally, assets were created to recognize the special gestures that would trigger input to the system. For example, the pointing gesture, shown in Fig. 4 (index finger extended, all other digits closed), would start drawing a 3D free-form curve and continue drawing the curve until the gesture was broken. A pinching gesture, shown in Fig. 5 (thumb and index finger touching, middle, ring, and pinky fingers extended), was used to delete a specific free-form curve across all connected clients. A gesture consisting of a double “thumbs-up,” shown in Fig. 6 (all eight fingers closed, both thumbs extended and pointed toward the ceiling), was used as a general clear, which would delete all the free-form curves across all connected clients. The appropriate data for each of these input gestures were also serialized and sent across the network to replicate the effect of the gesture on the remote clients. Hence, as a collaborator drew a free-form curve locally, the curve points were serialized and sent to the other clients through the server. Once the data reached the client they were reconstructed into a free-form curve matching the original. Each curve was also uniquely named such that it could be uniquely identified on each client. Thus, when a collaborator deleted a particular curve, messages were sent across the network deleting the curve from all clients keeping the VE consistent.
Methodology
Overview.
In order to test the hypothesis that the system described in Sec. 3 would improve communication against current video conferencing tools, an experimental study was conducted to assess the change in both communication time and accuracy when using the collaborative VE to communicate 3D geometry data. skype was used as the control to represent traditional collaborative software. It was hypothesized that a reduced amount of time to communicate would be required and that there would be increased accuracy of the information being communicated when the collaborators used the VE.
In order to test this hypothesis, participants were recruited and required to bring a friend or colleague with them to form a team of two. During the study, each team participated in a single session lasting approximately 1 h during which the two team members would be randomly assigned the roles of teacher and student. The teacher would be given a 3D printed model of the movement path of the tool head of a robotic arm (shown in Fig. 7). After learning the path, the teacher would use either skype or the collaborative VE described in Sec. 3 to teach their teammate the path. Once the teammates felt like they knew the path, the communication link between the team would be broken, and both team members would use the VE in a noncollaborative mode to record their understanding of the original path. The recorded data were used to analyze communication accuracy and were always recorded in the VE regardless of whether communication was done in skype or the VE. Additionally, the time that a teacher spent teaching the path to their teammate was recorded and used to analyze the amount of time required to communicate a path in skype versus in the VE.
Setup.
Since this implementation of the collaborative VE does not include support for audio, the audio streaming support in skype was not used. Instead, the physical areas for each client were placed in close proximity such that participants could speak to each other normally as seen in Fig. 8. However, when communicating, the line of sight between participants was blocked such that all visual communication went through the collaboration tool for that test. This kept the audio communication in both environments comparatively similar and eliminated confounding factors in the analysis. In addition, the tests were conducted in a quite area in order to minimize confounding background noise.
Six unique paths were created with varying degrees of geometric complexity such as lines versus curves, dimensionality, number of segments, and patterned versus free form. The 3D printed models that were given to the teacher to memorize are shown in Fig. 7. Path 1 was one-dimensional consisting of a straight line. Path 2 was 3D consisting of three straight lines. Path 3 was 3D consisting of one straight line and three similar arcs. Path 4 was 2D consisting of one “S” shaped curve. Path 5 was 3D consisting of one long free-form curve. Path 6 was 3D consisting of one stylized spiral. The tutorial path was also 3D consisting of five straight line segments. The tutorial path was used for explanation and training purposes.
Of those six paths, each team was assigned to teach four paths: two in the VE and two in skype. The order and collaboration environment were randomly assigned such that each model was taught by 24 teams, 12 times in the VE and 12 times in skype. In addition, the order in which the environments were used was also randomized such that half of the teams used the VE first and half of the teams used skype first. The roles of teacher and student were also randomized such that each participant taught once in skype and once in the VE.
Experimental Session.
Each team was asked to participate in a single 1-h session. When a team arrived, they were given an overview of the experiment and signed the appropriate waivers. After signing the waivers, the participants were given an overview of the Skype collaboration environment and the tools available. After explaining skype, the team was given an overview of the VE, how the headsets and motion capture worked, and the tools available for collaboration in the VE. After the team had been familiarized with both systems, they were given time to practice with the VE and the input gestures to draw and delete free-form curves.
After the familiarization phase, participants were given a tutorial on how their final input would be recorded for accuracy analysis. During the tutorial, the tutorial path was placed in the VE and participants were asked to trace it. Each participant's trace of the tutorial path was recorded and used as baseline accuracy data. An example is shown in Fig. 9.
After participants had completed familiarization and the tutorial, the experimental task began. One teammate was randomly assigned to be the teacher and given a model to memorize. After memorization, both teammates were set up in the randomly assigned collaboration environment. A timer was started when the teacher began the explanation, and was stopped once the student felt they could reproduce the path. After the timer was stopped, no further communication or clarification was allowed. In order to keep experimental sessions to be less than 1 h, the teaching time was limited to 4 min. After the explanation, participants used the VE in a noncollaborative mode to record their understanding of the path. The full experimental task (starting with random teacher selection) was repeated until the team had taught all four of their assigned paths.
After teaching all four paths both team members were asked to complete a short survey about their experience with the tool. This survey completed an experimental session.
In order to more closely approximate a real-life collaboration situation, participants had access to a 3D model of the robotic arm and tool head in both skype and the VE. In skype, they had access to the model through the Autodesk A360 viewer. This application provides standard viewing controls such as pan, rotate, and zoom. Using skype, participants were able to share their screen and show their teammate their view of the model and indicate locations they were referencing with their mouse. For the skype environment, webcams were also provided so that participants could also video chat and use their hands and arms to demonstrate the path if preferred. A model of the robotic arm and tool head were placed in the VE as well, scaled to full size. In the VE, participants were able to use their hands (as virtual avatars) to point, touch, and indicate locations on the model. In addition, participants could use the drawing tools to also mark, highlight, and draw in the VE.
Participants.
Seventy four participants were recruited from Brigham Young University's School of Engineering and Technology to form 37 teams. Thirty six datasets were collected with the 37th team used to replace a corrupted dataset from an earlier team. Participants ranged from college freshmen to graduate students. Of the 74 participants, 58 had less than 5 h previous experience with VR and only six had more than 10 h of previous experience. In order to encourage participation, each participant was compensated $10 for completing the 1-h session. In order to encourage participants to explain the path clearly and quickly, participants were measured on the speed of their teaching and the accuracy of their response paths. To incentivize quick, high quality work, the two top performing teams were awarded VR headsets and gift cards. Although participants were allowed to terminate their participation early for any reason, none did.
Results and Discussion
Data Processing.
where A–B is either teacher model, student model, or student–teacher model and is the nth measurement point on the curve.
Results.
Table 1 presents student's t-tests which indicate that less time was required to teach a path for paths 2–5 in the VE compared to SKYPE. The time reduction for path 4 is interesting since path 4 is two-dimensional and hence an environment such as skype should be sufficient to fully convey the information. Also of interest is the fact that, despite being a 3D path, the time to teach path 6 was not statistically different. This result may in part be explained by the fact that path 6 is a spiral and hence it was easy to describe the general shape of the path verbally. Finally, although not statistically significant, it is also interesting to note that, on average, path 1 took longer to explain in the VE than in skype. This result likely stems from the relative simplicity of teaching path 1 and the fact that in the VE, participants had trouble with the sensor not picking up an input gesture. Hence, they would spend several seconds adjusting their pose until the sensor recognized the gesture. Although these delays existed in all the VE time measurements, only path 1 showed a negative effect due to the extremely short time required to teach it.
Model | 1 | 2 | 3 |
---|---|---|---|
Skype mean time (s) | 17.17 | 167.95 | 168.61 |
VE mean time (s) | 26.64 | 98.71 | 92.19 |
Percent reduction | −55.15 | 41.22 | 45.32 |
p-value | 0.911 | 0.008 | 0.001 |
Model | 1 | 2 | 3 |
---|---|---|---|
Skype mean time (s) | 17.17 | 167.95 | 168.61 |
VE mean time (s) | 26.64 | 98.71 | 92.19 |
Percent reduction | −55.15 | 41.22 | 45.32 |
p-value | 0.911 | 0.008 | 0.001 |
Model | 4 | 5 | 6 |
---|---|---|---|
Skype mean time (s) | 121.06 | 172.95 | 129.69 |
VE mean time (s) | 82.00 | 108.47 | 102.20 |
Percent reduction | 32.27 | 37.28 | 21.20 |
p-value | 0.0691 | 0.002 | 0.1832 |
Model | 4 | 5 | 6 |
---|---|---|---|
Skype mean time (s) | 121.06 | 172.95 | 129.69 |
VE mean time (s) | 82.00 | 108.47 | 102.20 |
Percent reduction | 32.27 | 37.28 | 21.20 |
p-value | 0.0691 | 0.002 | 0.1832 |
In Table 2, the mean accuracy values are calculated as described in Sec. 5.1 as well as p-values from a student's t-test means comparison. From Table 2, it can be seen that there was a statistically significant improvement in accuracy of the student's recording to the teacher's recording of paths 1, 3, and 5. Paths 1, 3, and 4 also had statistically significant improvements in the accuracy of the student's recording to the original model (which the students never saw). It is interesting to note that path 4 shows a statistically significant improvement in the teacher's accuracy to the original model, and in general (though not statistically significant with the current amount of data), this trend holds for most of the paths. This could potentially be explained by the teachers who taught in the VE having additional practice drawing the path in the VE, the fact that teachers who taught in the VE did not have to fit their mental model to 2D views of the 3D model, the fact that teachers in the VE had a shorter time between memorizing the model and recording their answer, or by some combination of these factors. Path 5 is also interesting due to the student's accuracy to the original not being significantly different, while their accuracy to the teacher is better. This is likely caused by the difficulty of memorizing path 5. During the study, it was observed that parts of path 5 were frequently forgotten or taught incorrectly regardless of the system being used for collaboration.
Comparison | Skype Acc. | VE Acc. | p-value |
---|---|---|---|
Path 1 | |||
Teacher model | 3476 | 3221 | 0.3181 |
Student model | 3832 | 3071 | 0.0937 |
Student–Teacher model | 4585 | 3923 | 0.0792 |
Path 2 | |||
Teacher model | 14,112 | 10,867 | 0.2370 |
Student model | 15,926 | 12,126 | 0.3944 |
Student–Teacher model | 17,883 | 11,844 | 0.1107 |
Path 3 | |||
Teacher model | 11,581 | 10,344 | 0.2003 |
Student model | 16,488 | 12,516 | 0.0337 |
Student–Teacher model | 15,961 | 11,858 | 0.0415 |
Path 4 | |||
Teacher model | 10,673 | 8107 | 0.0652 |
Student model | 11,777 | 9210 | 0.0809 |
Student–Teacher model | 12,586 | 10,247 | 0.1521 |
Path 5 | |||
Teacher model | 16,215 | 19,041 | 0.8818 |
Student model | 21,512 | 20,919 | 0.4256 |
Student–Teacher model | 17,309 | 11,689 | 0.0022 |
Path 6 | |||
Teacher model | 14,040 | 14,618 | 0.5956 |
Student model | 18,082 | 17,030 | 0.3677 |
Student–Teacher model | 16,788 | 14,104 | 0.1678 |
Comparison | Skype Acc. | VE Acc. | p-value |
---|---|---|---|
Path 1 | |||
Teacher model | 3476 | 3221 | 0.3181 |
Student model | 3832 | 3071 | 0.0937 |
Student–Teacher model | 4585 | 3923 | 0.0792 |
Path 2 | |||
Teacher model | 14,112 | 10,867 | 0.2370 |
Student model | 15,926 | 12,126 | 0.3944 |
Student–Teacher model | 17,883 | 11,844 | 0.1107 |
Path 3 | |||
Teacher model | 11,581 | 10,344 | 0.2003 |
Student model | 16,488 | 12,516 | 0.0337 |
Student–Teacher model | 15,961 | 11,858 | 0.0415 |
Path 4 | |||
Teacher model | 10,673 | 8107 | 0.0652 |
Student model | 11,777 | 9210 | 0.0809 |
Student–Teacher model | 12,586 | 10,247 | 0.1521 |
Path 5 | |||
Teacher model | 16,215 | 19,041 | 0.8818 |
Student model | 21,512 | 20,919 | 0.4256 |
Student–Teacher model | 17,309 | 11,689 | 0.0022 |
Path 6 | |||
Teacher model | 14,040 | 14,618 | 0.5956 |
Student model | 18,082 | 17,030 | 0.3677 |
Student–Teacher model | 16,788 | 14,104 | 0.1678 |
Figure 11 shows the frequency of participants' responses to the question of how suitable each environment was for communicating complex 3D information. As shown in Fig. 11, skype was typically rated as somewhat suitable while the VE was typically rated as ideal or very suitable. Table 3 shows the frequency of various positive and negative reasons given when asked why they assigned each environment a particular suitability rating. The most common reason for rating skype's suitability lower than the VEs was that using skype only provided a 2D view of the 3D information requiring the student to extrapolate 3D position and shape from multiple 2D views, whereas the VE allows native 3D viewing. Additional common reasons for rating the VE as more suitable included the ability to draw and annotate the 3D model in the VE and the ability to gesture (point, sweep, demarcate, etc.) with their hands in 3D instead of using a 2D mouse pointer in skype. Reasons for marking skype higher on the suitability scale included the ability to see the other person's face in skype and the high precision of the mouse input. The most common reason for marking the VE lower on the suitability scale was the difficulty some participants had getting the VE to recognize their input gestures to draw or delete curves.
Skype | Pros | Reason |
7 | Could see the other person | |
6 | Precision from mouse input | |
Cons | ||
32 | Viewing was 2D | |
13 | Communication less detailed | |
3 | Difficulty controlling software | |
VE | Pros | |
36 | Could view in 3D | |
26 | Could draw | |
25 | Could gesture (point, etc.) | |
16 | More interactive | |
5 | Simultaneous interaction | |
Cons | ||
7 | Poor input gesture recognition |
Skype | Pros | Reason |
7 | Could see the other person | |
6 | Precision from mouse input | |
Cons | ||
32 | Viewing was 2D | |
13 | Communication less detailed | |
3 | Difficulty controlling software | |
VE | Pros | |
36 | Could view in 3D | |
26 | Could draw | |
25 | Could gesture (point, etc.) | |
16 | More interactive | |
5 | Simultaneous interaction | |
Cons | ||
7 | Poor input gesture recognition |
Additionally, it was noted that while there was a high amount of gesturing in both systems, there were significant qualitative differences. In skype, participants usually used the mouse to demonstrate the locations of key points (such as major direction changes) and how they were connected. Often, this information was repeated from multiple angles and any additional detail was either described verbally or left out altogether. In contrast, when using the VE, participants would often quickly draw a rough sketch of the path and then use gestures to communicate refining information such as curvature and more specific locations. In addition, the verbal communication in the VE typically centered on the gestures being made. For example: “this high,” “looks like this,” and “curves more like this” (with specific actions corresponding to the adjective or pronoun).
Figure 12 shows the frequency of responses of how effective participants felt either mouse gestures in skype or hand gestures in the VE were for conveying the path information. Responses are also broken out by the effectiveness of the participant's own gestures versus the effectiveness of the partner's gestures. It is interesting to note that participants typically rated their own gestures as less effective than their partner's regardless of the system. However, despite this self-deprecation, participants felt their own hand gestures were significantly more effective at communicating the path to their partner then their partner's mouse gestures in skype.
Figure 13 shows participants' environment preferences for communicating complex 3D data. From their responses, we see that 69 out of the 74 respondents (93%) prefer the VE, 1 (1%) prefer skype, and 4 (5%) said they would prefer something else such as showing in person on a physical model or drawing with pen and paper.
Discussion.
Taking together the results presented above, in many cases, a collaborative VE that allows for gestures and communication tools can provide significant collaboration benefits over current video conferencing systems, and at much more reasonable costs than similarly capable CAVE systems. In many situations, even when the information is only 2D, there are improvements to both the amount of time required to communicate and the accuracy of communication. From the participants' feedback, this appears to be due to 3D viewing instead of 2D, communication tools such as drawing, and support for communication gestures such as pointing and waving. In addition to the results presented above, 96% of participants stated that they enjoyed using the VE, 90% believed that VR tools could improve their engineering work, and 86% of participants stated they would like to have access to VR tools in their future workplace. While these survey responses may not be representative of the larger population due to the volunteer nature of the participants, the results are sufficiently congruous as to provide compelling motivation for further study of the wider applicability of these techniques. Furthermore, these results concur with some of the findings of recent research into communication in VE regarding the novel benefits of drawing in 3D, and both advantages and shortcomings of collaborating in VR [21]. Likewise, the teacher–student model presented herein supports elements of a meta-analysis that VEs are promising tools for students to learn certain tasks and principles [22]. Finally, as human–computer interfaces improve and VR gestural support becomes faster and more precise, the training potential of professionals through VE systems will expand, as has already been observed in doctors acquiring laparoscopic skills through virtual training [23].
Limitations.
There are two significant limitations to this work. First and foremost, as mentioned previously, the participants in this study were recruited on a volunteer basis and were teamed with someone with whom they were familiar. This means that the sample is not well representative of the general population, and may have a higher predisposition to prefer VR. However, given the more wide-spread interest and commercial competition in VR and augmented reality products, this predisposition likely does not differ significantly from the general population of a similar age. In addition, it is this younger population that forms the future of the engineering workforce. Also, working with a familiar teammate may not be fully representative of a real work environment since there are often collaborators from outside a person's circle of contacts. However, this familiarity could be both advantageous and disadvantageous since familiarity may increase understanding and reduce the time needed to communicate, but it may also lead to excessive socialization, which reduces focus on the task.
Further, we found very little to no cybersickness in our participant group. We hypothesize that this in due to a number of factors such as the quality of the tracking provided by the Vive, the low speed dynamics of the environment (i.e., not a rollercoaster, flight simulator, or similar), and the population used in this study (early adopters, interested in VR), which may not be the case in all experiments as observed in other VR studies [24].
Second, this work did not include audio streaming. Before a collaborative VE could fully support a physically distributed workforce, audio streaming would need to be implemented and tested. Depending on the specifics of the network connecting the system, a noticeable delay could be introduced, which could hamper collaboration effectiveness. However, this delay would be present regardless of whether the collaborators were using a video conference or a collaborative VE. In addition, since more of the information needed to be conveyed verbally in the video-call environment, network audio delays may be more detrimental to the video-call environment than to a collaborative VE.
Conclusions and Future Work
As discussed in Sec. 2, the literature is replete with potential applications and proven benefits to VR systems especially as collaborative tools. However, until the recent release of low-cost consumer-grade VR hardware, VR setups were prohibitively expensive and hence use of the hardware was limited to the most important tasks. Now, VR hardware has reached a tipping point where deploying VR hardware to large portions of a company's workforce is more feasible. In turn, this higher availability would make a collaborative VE, such as the one presented here, not just feasible, but beneficial by improving communication, collaboration, and design understanding. These improvements could reduce the number of costly turn backs and design changes required after a design freeze. Long term, distributed collaborative VEs could lay the foundation for a globally distributed workforce, which is strengthened by talents, values, and expertise of many cultures and countries but allows them all to work together as if they were located locally. In fact, widespread availability of VR and the related technologies of augmented reality could produce as monumental of changes to our global society as the invention of the airplane, the computer, the Internet, and smart devices, all of which have significantly changed and improved how humans communicate, collaborate, design, engineer, and manufacture.
Future Work.
While this work has demonstrated the feasibility and benefits of a collaborative VE using low-cost consumer-grade hardware, additional needed investigations remain. When asked what improvements and changes participants would like to see in the VE, the most common request was improved hand tracking for gestural input such as drawing or deleting curves. Additional common requests included an expanded toolbox such as color selection, points, straight lines, grid snapping, and shape primitives, the ability to modify and delete portions of previously created curves, the use of a more precise input such as the HTC Vive controllers for drawing curves, the ability to more actively control your viewpoint such as zooming in and out, and improved graphics.
Additional future work also includes the integration of additional sensors to create a richer more immersive experience. Additional sensors to integrate include full body motion capture, which could be used to create full body avatars, real-time facial capture, which would allow animation of the avatar's face to convey an additional depth of communication, and as mentioned previously, integration of 3D audio streaming to allow the system to be distributed.
Funding Data
Division of Computer and Network Systems (Grant No. 1067940).