Omnidata: (Steerable Datasets)

A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

Ainaz Eftekhar*, Alexander Sax*, Roman Bachmann, Jitendra Malik, Amir Zamir

ICCV 2021

Overview


The Omnidata annotator is a pipeline to resample comprehensive 3D scans from the real-world into static multi-task vision datasets. Because this resampling is parametric, we can control or steer datasets. This enables interesting lines of research (such as looking into the effects of these different parameters). And the resampled data can be used to train strong and robust vision models (results, demo).

For example, we create a starter dataset of 14 million images sampled from 2000 scanned spaces. Familiar architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks, despite having seen no benchmark or non-pipeline data. The depth estimation network outperforms MiDaS and the surface normal estimation network is the first to achieve human-level performance for in-the-wild surface normal estimation (at least according to one metric on the OASIS benchmark).

With 3D scanners becoming increasingly prevalent (e.g. on iPhones and iPads), we expect 3D scans to be a rich source of data in the future. We're therefore open-sourcing everything in order to make it easier to do research with steerable datasets. The Dockerized pipeline with CLI and its (mostly Python) code, PyTorch dataloaders for the resulting data, the starter dataset, download scripts, and other utilities are available in the linked GitHub repos above.


Introductory video (5 min)

Why make a 3D scan → 2D image pipeline?

i

A means to train really strong vision models

↪ Tooling

ii

Avenue to a "dataset design guide" for computer vision

↪ Demo

iii

Matched-pairs analysis for tasks / domains / data

↪ ex. MTL Experiments

iv

Large datasets for non-recognition spatial tasks

↪ Starter Data

i. A means to train state-of-the-art models

This isn't meant to be something that works in 5 years--it's meant to be something that works now. Below, we show that the pipeline can create datasets for several different computer vision tasks--each capable of training existing approaches to competitive or better than state-of-the-art performance.

Depth estimation on OASIS images. We trained the method of MiDaS DPT-Hyprid (Ranftl et al. 2021), but only on a starter datset of data output from our pipeline (Omni).

The results above look comparable to the original, and the quanitative results tell the same story: the version trained on the starter data outperformed the original mix of 10 existing depth-specific datasets when evaluated zero-shot on NYU (Silberman et al. 2014) and OASIS (Chen et al. 2020).

Surface normals extracted from depth predictions. The high-resolution meshes in the starter dataset also seem to produce networks that make more precies shape predictions, as shown by the surface normal vectors extracted from the predictions in the bottom row.

For surface normal estimation (below), the generated dataset also yield networks that show state-of-the-art zero-shot performance on OASIS. Along one of the metrics, the network gets human-level performance (though not for other metrics). The networks don't seem human-level, but they are qualitatively better than fancier approaches trained on existing datasets:

Surface normal prediction on OASIS images. Neither model saw OASIS images during training.

Surface normal prediction on OASIS images. The Omnidata-trained model outperformed the baseline model trained on OASIS data itself.

The annotator can be used for many downstream tasks: it produces labels for 21 different mid-level cues. These vision-based cues can also be used to improve perfomance for visuomotor tasks: for example navigation ones, manipulation ones, and it works with actual physical robots.

For a complete list of labels and how they are produced, see the annotator GitHub. To try out the pretrained models, upload an image to a live demo or download the PyTorch weights. Or train your own model using the starter data and dataloaders.


ii. Avenue to a dataset 'design guide'

Most existing vision datasets capture images once. Someone or something decides which images to take and which not to, what sensors to use, what subjects to frame, etc. After capture, these design decisions become fixed and difficult to change. These biases in the training data become encoded in models trained on the data, and affect the resulting models' ability to generalize to other types of situations.
By capturing as much information as possible and then parametrically resampling that data into 3D images, we can probe the effects of different sampling distributions and data domains. For example, previous research has identified various types of selection bias such as photographer’s bias, viewpoint bias. Choice of sensor (e.g. RGB vs. LIDAR) also affects what information is available to the model.
These choices have real impact. For example, selecting different aperture sizes changes the makeup of images (below and left). In effect, making an dataset more or less object-centric. Play with some of these effects in our dataset design demo or make one yourself with one of the one-line examples in our annotator quickstart.

Dataset field-of-view influences image content.

e.g. FoV is correlated with object-level focus.




iii. Matched-pair analysis

What causes a model pretrained on ImageNet to transfer better to other downstream tasks compared to alternatives such as a depth model trained on NYU? We can create diverse multi-annotated datasets which simplifies answering questions such as this one by constucting matched-pairs analyses or other types of controlled randomized studies.

Cross-task comparisons are complicated by confounding factors. Comparing two pretrained models utility for transfer learning is difficult when the two models were trained on disjoint datasets with different parameters: domains, numbers of images, sensor types, resolutions, etc.




iv. Large datasets for even non-recognition spatial tasks

As 3D sensors are become cheaper and ubiquitous, we expect that 3D data will become more common, too. The Omnidata pipeline enables creating large annotated datasets from these 3D scans. This will make it possible to create very-large-scale datasets for non-recognition tasks like depth estimation--and even those where there are no direct sensors (e.g. curvature estimation).

13 of 21 mid-level cues from the Annotator. Each label/cue is produced for each RGB view/point combination, and there are guaranteed to be 'k' views of each point.

Even already there is enough data available to tackle points i-iii above for several common computer vision tasks. We show some examples of each in the paper, using a starter dataset we annotated from 3D-scanned data that is already publicly available (Replica, Taskonomy, Hypersim, Google Scanned Objects, BlendedMVG, Habitat 2.0, and CLEVR). It's possible to annotate other datasets, too, (e.g. CARLA for self-driving, like shown in the intro video).
For complete information about that starter dataset (including tools to download it), see the data docs.


Annotator Overview:

This section provides a brief overview of the annotation pipeline, but you can try a one-line example and get much more (a Docker, code and documentation, PyTorch dataloaders) over at the main annotator repo.

The annotator takes in one of the following inputs and generates a static vision dataset of multiple mid-level cues (21 in the first release).

Annotator: inputs and outputs.

The annotator generates images and videos of aligned mid-level cues, given an untextured mesh, a texture/aligned RGB images, and an optional pre-generated camera pose file. A 3D pointcloud can be used as well: simply mesh the pointcloud using a standard mesher like COLMAP (result shown above).

The pipeline works by generating camera locations and point-of-interest locations (subject to parametric multi-view constraints). Then, for each combination of camera + point, the annotator generates views (either images or videos).

Static views of camera/point combinations. Multi-view constraints guarantee at least k views of each point.

Videos of interplated trajectories. The annotator can also generate videos by interpolating between cameras.




All mid-level cues are available for each frame. The following figure shows a few of these cues on a building from the Replica dataset.

5 of 21 outputs (video sampling).


How does the pipeline do this? It creates cameras, points, and views in 4 stages (below). For more information, check out the paper or annotator repo.




Omnidata Ecosystem:

We're open-sourcing everything that we used in the paper. We organized this into 4 primary components: the annotator, the starter dataset, the tooling (dataloaders, pretrained models, MiDaS training code, etc.), and a code dump for future reference. Click below to navigate directly to the repository.

Annotator

The annotator github contains examples, documentation, a Dockerized runnable container, and the raw code.


↪ Annotator GitHub

Starter Data

The omnitools CLI contains parallelized scripts to download and manipulate some or all of the 14 million image starter data. These scripts can also be reused to manipulate data generated from the annotator.


↪ Starter Data

Tooling

The tooling repo contains many of the tools that we found useful during the project: PyTorch dataloaders for annotator-produced data, data transformations, training pipelines, and our reimplementation of MiDaS.


↪ Tooling Github

Paper Code Dump

All the code from the paper, preserved for posterity.


↪ Paper Code GitHub

Paper

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans.
Eftekhar*, Sax*, Bachmann, Malik, and Zamir.
ICCV 2021



Team

Ainaz Eftekhar

Sharif University of Technology

Alexander (Sasha) Sax

UC Berkeley

Roman Bachmann

EPFL

Jitendra Malik

UC Berkeley

Amir Zamir

EPFL