In 2016, I worked on a research project called “Exploring Reality” or ExR for short. The idea was to provide the ability to do 360-degree security monitoring of public areas without compromising people’s privacy. For example, imagine a playground where children are playing and people are walking around. By placing cameras that continuously monitor the area, we basically violate those people’s privacy. Anyone with access to the security camera recordings can see everything each one of those children and people are doing.
The question is, can we find a way to let security personnel identify suspicious activity while maintaining people’s privacy?
The idea I investigated was built of the following three steps:
Scroll to the bottom for the TL;DR!
The first step was to map out a physical area and creating a virtual replica of it. For simplicity, I chose my office as my physical area and set out to map it.
At the time, AR technology was taking its first steps towards commercialization. Google had just launched the Project Tango research tablet that was aimed at giving developers and researchers a single platform on which they could explore AR technologies and experiment. Later, the lessons learned from Project Tango turned into Google’s ARCore.
Google’s Project Tango Tablet
I used a pre-existing tool called Phi.3D for Tango created by DotProduct LLC, later replaced by Dot3D for Tango. The Tango version was a free Beta. It started out nicely, but I quickly found that there was a size limit for the mapping so I couldn’t map the entire office in one go but rather had to map various sections separately and then figure out how to stitch them together. The result of the scanning was a bunch of partially overlapping point clouds.
In order to stitch the separate point clouds together, I turned to PCL (Point Cloud Library). I first used the CloudCompare open-source software to import the scan results created by Phi.3D and convert them into an ASCII cloud format that I taught PCL to read. Using PCL, I was able to write some code that unified all the individual point clouds into a single one and then construct a mesh from the individual vertices. This process took a very long time (several hours) and produced a single mesh that had around 11 million vertices and 20 million faces. The mesh was too heavy to view even on my laptop so certainly not suitable for use on a mobile phone.
Initial point-clouds overlaid on top of each other
I then proceeded to write my own C++ code that employed voxel subsampling on each of the individual point clouds, then used a Greedy Triangulation algorithm within PCL to build a mesh from each point cloud, and finally combine them into a single model. This process took around 10 minutes to run and I ended up with a mesh that had around 5 million faces, looked virtually the same and worked nicely within the Unity 3D engine.
Final office mesh after voxel sub-sampling simplification, stitching and reconstruction
So, I now had a usable virtual representation of a real-world location, my office!
The second step was to be able to capture the movements of a real person. For this I chose to use a PrimeSense depth camera. PrimeSense was an Israeli company later acquired by Apple. They created the first generation of the Xbox Kinect camera. I used the PrimeSense PSDK 5.0 (PSMP05000) Camera in conjunction with the PrimeSense OpenNI 2 SDK and the PrimeSense NITE 2 Middleware, which give you very convenient utility functions such as pose estimation and skeletal structure detection.
PrimeSense Depth Camera
Putting these together I was able to quickly write some C++ code that uses the camera to identify the skeletal structure of the user standing in front of the camera. I then developed a simple Node.JS service backed by a MariaDB database to store the positions and orientations of all the detected joints in each time point. The service allows getting real-time data from the camera reader or do queries on the database to replay data.
The final step was to map the captured movements to a virtual avatar within the virtual environment. For this I used the Unity 3D engine.
I started by importing the mapping I captured of my office into Unity and added a human avatar into the scene. Next, I wrote some C# code that allowed my Unity application to connect to the collector Node.JS service and either can live data or make queries for past data replay.
As I wanted to be able to place the viewer within the scene and be able to walk around while watching the event unfold, I integrated with Google’s cardboard SDK so I can build the app to run on a mobile phone and create a real VR experience. This also gave me the ability to map phone movements to the Unity camera, thereby giving the users the feeling they are moving within the scene.
Final Reality Replication — The latency seems a bit high, but it’s just an overloaded laptop recording multiple feeds at the same time, in reality this was all smooth and in real-time
You may notice from the images that the capture was actually a mirror image of the real movement, but I opted to not try and fix this as my goal was to prove the viability of the technology and had a limited timeframe in which to do it.
The entire process took me almost 3 months to complete.
This project demonstrated that we could monitor an area and capture all the movements within. We could then provide security personnel the ability to walk around the area in VR and view avatars re-enacting the movement of people that were present in the real area. When suspicious activity is detected, we would allow the security person to view the original footage of the specific individual whose movements were represented by the selected avatar. Thus, they are able to identify the suspect while maintaining the privacy of unrelated people in the surrounding environment.