Robustness of Vision Transformers for Depth Estimation

Explaining real-time working of Transformer-Based Vison for depth estimation

7 min readMar 9, 2024

In recent years, 3D computer vision has gained significant traction due to its numerous advantages over traditional 2D approaches.

Applications such as robotics, autonomous driving, surgical assistance, crop health monitoring, and inventory monitoring can greatly benefit greatly from the capabilities of 3D Computer Vision.

While traditional 2D Computer Vision remains popular for tasks like Object Detection and Segmentation, 3D Computer Vision offers distinct advantages.

Moreover, Generative AI is now venturing into the realm of 3D space.

As we work on Industrial 4.0 (now evolving into Industrial 5.0) projects, the adoption of real-time artificial intelligence, particularly computer vision, has a substantial impact on forward-thinking businesses. Additionally, environmental considerations, technical requirements, standard operating procedures, business constraints, and other relevant factors must be taken into account.

Since Alexey Dosovitskiy and his team successfully applied a transformer model to various image recognition benchmarks, there has been a surge of research and development in the field of Vision Transformers (ViT). Unlike traditional Convolutional Neural Networks(CNNs), ViTs operate directly on sequences of image patches, treating them as tokens in a sequence.

Vision Transformer Architectures:

Encoder-only models: ViT models are encoder-only, meaning they process the input image patches but do not generate dense output predictions.
Decoder models: The Dense Prediction Transformer (DPT) uses a convolutional decoder to combine the image-like feature representations produced by the transformer encoder into full-resolution predictions.

Depth Estimation using Direct Linear Transformation (DLT) algorithm:

In the DPT architecture, the decoder is responsible for assembling tokens from various stages of the ViT into image-like representations at different resolutions. These features are then progressively fused into the final dense prediction using a convolutional decoder.

Depth Prediction for Augmented/Virtual Reality:

Accurate depth estimation from images using 3D vision models is crucial for creating augmented reality (AR) and virtual reality (VR) experiences. Applications like virtual furniture placement, glasses fitting, and industrial worker training simulations require precise depth cues for accurate spatial positioning and occlusion handling of virtual objects in real-world scenes.

Challenges in Depth Prediction:
While transformer-based depth prediction models have shown promising results, some of their performance have been affected by several factors like:

Image quality
Scene complexity
Focal length
Distance from the camera
Image size
Pixel conversion
Lighting conditions

In real-time performance, computational requirements may vary depending on the devices and hardware used.

Addressing additional challenges such as:

Occlusion Handling:
→ Detecting occlusions
→ Handling partial occlusions
→ Handling complete occlusions
→ Evaluating occlusion performance
Computational Resources:
→ Model compression
→ GPU acceleration
→ Optimisation techniques
A crucial aspect is the camera calibration and synchronisation required between the RGB camera and the depth estimation pipeline for AR/VR applications.
Addressing necessitates model compression, GPU acceleration, and optimisations to meet the demands of low latency and high frame rate captures in diverse real-world environments.

Now, Let’s look at a brief demo of the DPT model proposed in “Vision Transformers for Dense Prediction” by René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun.

Abstract:
It assembles tokens from various stages of the vision transformer and combines them into image-like representations at various resolutions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution, providing a global receptive field at every stage. These properties allow the Dense Vision Transformer to provide finer-grained and more globally coherent predictions compared to fully-convolutional networks. Experimental results show substantial improvements on dense prediction tasks, especially when a large amount of training data is available.For monocular depth estimation, it achieves an improvement of up to 28% in relative performance compared to a state-of-the-art fully-convolutional network.For semantic segmentation on the ADE20K dataset, it sets a new state-of-the-art performance with a 49.02% mIoU. The architecture can be fine-tuned on smaller datasets like NYUv2, KITTI, and Pascal Context, where it also sets new state-of-the-art performance.

Demo:

Loading DPT model from Torch Hub:

model_type = "DPT_Large"     # MiDaS v3 - Large     (highest accuracy, slowest inference speed)
#model_type = "DPT_Hybrid"   # MiDaS v3 - Hybrid    (medium accuracy, medium inference speed)
#model_type = "MiDaS_small"  # MiDaS v2.1 - Small   (lowest accuracy, highest inference speed)

midas = torch.hub.load("intel-isl/MiDaS", model_type)

MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with multi-objective optimisation.

The original model that was trained on 5 datasets (MIX 5 in the paper) can be found here.

Figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.

Define Transforms:

midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
transform = midas_transforms.dpt_transform

Create Pipeline for evaluating model on images:

def pipeline(img_path):
    img = cv2.imread(img_path) # Reading the image using OpenCV
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Converting the image from BGR to RGB 
    ip_batch = transform(img).to(device) # Converting & Trasforming tensor imgage to GPU/ CPU
    with torch.no_grad(): # Disable gradient calculation in the backpropogation
        predict = model(ip_batch) # Performing only forward propgation
        print(predict)
        predict = torch.nn.functional.interpolate(predict.unsqueeze(1),    # Resize the
                                                  size = img.shape[:2],    # predicted tensor using 
                                                  mode='bicubic',          # bicubic interpolation
                                                  align_corners=False      
                                                 ).squeeze()
    op = predict.cpu().numpy()
    return op

def rescale_output(pred):
    min = np.min(pred) # Minimum of Input array
    max = np.max(pred) # # Maximim of Input array
    rescaled = (pred-min)/max # Rescale the values of the input array to the range [0, 1]
    return rescaled

def get_img_array(filename):                             # Convert the image to a numpy array
    im = cv2.imread(filename)                            # and scale its values
    im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)             # between 0 and 1
    x = np.clip(np.asarray(im, dtype=float) / 255, 0, 1) # return the image
    return x

Next, Calculate depth estimation using the pipeline function on each image file.
Then, it rescales the depth output using the rescale_output function.
Visualizes the original input image alongside the coloured depth prediction.
Input image and the coloured depth map are horizontally stacked together.
Stacked image is then displayed using Matplotlib.

for filename in os.listdir(folder_path):
    filename = os.path.join(folder_path, filename)
    
    # Run Depth Estimation on input filename
    output = pipeline(filename)
    
    # Rescale depth output to relative depth
    output = rescale_output(output)
    depth_orig = output.copy()
    
    # Color Depth output on plasma colormap
    colored_depth = plasma(output)[:, :, :3]
    
    # Stack input image and depth prediction
    imgs = []
    input_img_array =get_img_array(filename)
    print(input_img_array.shape)
    imgs.append(input_img_array)
    imgs.append(colored_depth)
    img_set = np.hstack(imgs)
    plt.figure(figsize = (18,18))
    plt.imshow(img_set)
    plt.show()
    # break

Now, Convert a depth map into a 3D point cloud:

def threeD_point_cloud(depth_map):
    height, width = depth_map.shape[:2] #     # Get the height and width of the depth map
    # Define the camera matrix
    camera_matrix = torch.FloatTensor([[587, 0, width/2],
                                       [0, 587, height/2],
                                        [0, 0, -1]]).unsqueeze(0)
    
    # Scale the depth map to the range [0, 1]
    depth_map_min = np.min(depth_map)
    depth_map_max = np.max(depth_map)
    scaled_depth = (depth_map - depth_map_min)/depth_map_max
    
    # Convert the scaled depth map to a Torch tensor
    scaled_depth = torch.from_numpy(scaled_depth).view(1, 1, height, width)
    
    # Convert depth to 3D points using the camera matrix
    point_clouds = depth_to_3d(scaled_depth, camera_matrix, normalize_points=False)
    
    # Rearrange the dimensions and convert to numpy array
    cloud_room = point_clouds.permute(0, 2, 3, 1)[0].numpy()
    
    # Return the 3D point cloud
    return cloud_room

important: Where depth_to_3d returns a tensor with a 3d point per pixel of the same resolution as the input

Note: The depth map is then scaled to the range [0, 1] to ensure consistent processing.

Now convert point_clouds array into 2D array where each row represents a point in 3D space.

pcd_room = point_clouds.reshape(-1,3)

PointCloud object:

I choose Open3D: A Modern Library for 3D Data Processing and the backend is highly optimised and is set up for parallelisation.

pcd = o3d.geometry.PointCloud() # # Create a PointCloud object

# Assign the points from the pcd_room array to the PointCloud object
pcd.points = o3d.utility.Vector3dVector(pcd_room)

# Write the PointCloud object to a PLY file
o3d.io.write_point_cloud("window.ply", pcd) # SAVE Cloud Point

Read and visualise it:

pcd = o3d.io.read_point_cloud("window.ply")
o3d.visualization.draw_geometries([pcd])

Thank you for taking the time to read! Don’t forget to 👏 if you liked the article.

A Note on This Article

The insights and perspectives shared in this article are drawn from my personal experiences. As with any subjective matter, there may be differing viewpoints or approaches.

If you have any questions, concerns, or alternative perspectives to offer, I’d be glad to hear from you. An open dialogue allows us all to gain a deeper understanding of the topic at hand.

Feel free to share your thoughts or feedback in the comments below or reach out to me directly. I’m always eager to learn and grow through respectful discourse.

Hungry for AI? Follow, bite-sized brilliance awaits! ⚡

🔔 Follow Me: LinkedIn | GitHub | Twitter

Buy me a coffee: