Quiet
  • HOME
  • ARCHIVE
  • CATEGORIES
  • TAGS
  • LINKS
  • ABOUT

Tesla

  • HOME
  • ARCHIVE
  • CATEGORIES
  • TAGS
  • LINKS
  • ABOUT
Quiet主题
  • Paper
  • Computer Vision
  • 3D Reconstruction

VGGT

Tesla
Paper Computer Vision

2025-08-02 13:30:31

Introduction

  • An end-to-end transformer model, conceptually similar to Stable Diffusion or GPT but applied to a different task.
  • It takes a sequence of images as input.
  • It outputs camera poses, depth maps, point maps, and tracking information.
  • It can process hundreds of images within one second.
  • It infers the complete set of 3D attributes without requiring post-optimization.

Background

  1. Traditional 3D reconstruction utilize visual-geometry methods (BA)

    BA (Bundle Adjustment) is a mathematical optimization technique to refine the 3D structure and camera positions by minimizing reprojection errors across multiple views. 最小化重投影误差LM算法使得观测的图像点坐标与预测的图像点坐标之间的误差最小

  2. Machine learning method: address tasks like feature matching and monocular depth prediction

  3. VGGsfm: integrate the deep learning model into the SfM(structure from motion) to achieve end-to-end

  4. VGGT: predicts a full set of 3D attributes, including camera parameters, depth maps, point maps, and 3D point tracks.

Tracking Any Point: track the POI in the video sequence

Model Architecture

Aggregator

The model first processes input images using an aggregator.

  • DINO: Used for image patchification to get local tokens (tIt_ItI​).
  • Augmentation tokens:
    • A camera token (tgt_gtg​) is added.
    • A register token (tRt_RtR​) is added.
  • Concat: The tokens are concatenated, and positional embeddings are added.

Alternating Attention

The token sequence is processed through L layers of alternating attention blocks.

  • Global Attention: Attention is computed across tokens from all frames. Tensor shape: [B, N*P, C].
  • Frame Attention: Attention is computed among tokens within each individual frame. Tensor shape: [B*N, P, C].

The initial camera token (t_1gt\_{1g}t_1g) and register token (t_1R) are learnable, allowing the model to set the first frame as the coordinate system. The register token (t_R) is discarded during the prediction phase.

Prediction Heads

After the attention blocks, the tokens are passed to different prediction heads to generate the final outputs.

  • Camera Head:
    • Uses the camera token (t_ig).
    • Predicts the camera parameters (g) using self-attention and a linear layer.
  • DPT Head:
    • Uses the image tokens (t_iI).
    • Employs a DPT (Dense Prediction Transformer) to get feature maps F_i (with shape CxHxW).
    • A 3x3 convolution predicts the depth map (D_i), point map (P_i), tracking features (T_i), and a confidence map.
  • Track Head:
    • For a given query point (y_j) in a query image (I_q).
    • It performs a bilinear sample on the feature map (T_q) to get the feature (f_y).
    • This feature is correlated with other feature maps (T_i) to get correlation maps.
    • These maps are fed into a self-attention layer to predict the corresponding 2D points (y_i).

Training

The model is trained with a composite loss function. The camera loss uses the Huber Loss function.

  • Total Loss:

    L=Lcamera+Ldepth+Lpmap+λLtrackL=L_{camera}+L_{depth}+L_{pmap}+\lambda L_{track} L=Lcamera​+Ldepth​+Lpmap​+λLtrack​

  • Camera Loss:

    Lcamera=∑i=1N∥g^i−gi∥cL_{camera} = \sum_{i=1}^{N} \left\| \hat{g}_{i} - g_{i} \right\|_{c} Lcamera​=i=1∑N​∥g^​i​−gi​∥c​

  • Depth Loss:

    Ldepth=∑i=1N∣∑ip⊙(D^i−Di)∣+∣∑ip⊙(∇D^i−∇Di)∣−αlog⁡∑ipL_{depth} = \sum_{i=1}^{N} \left| \sum_{ip} \odot (\hat{D}_{i} - D_{i}) \right| + \left| \sum_{ip} \odot (\nabla \hat{D}_{i} - \nabla D_{i}) \right| - \alpha \log \sum_{ip} Ldepth​=i=1∑N​​ip∑​⊙(D^i​−Di​)​+​ip∑​⊙(∇D^i​−∇Di​)​−αlogip∑​

  • Point Map Loss:

    Lpmap=∑i=1N∣∑ip⊙(P^i−Pi)∣+∣∑ip⊙(∇P^i−∇Pi)∣−αlog⁡∑ipL_{pmap} = \sum_{i=1}^{N} \left| \sum_{ip} \odot (\hat{P}_{i} - P_{i}) \right| + \left| \sum_{ip} \odot (\nabla \hat{P}_{i} - \nabla P_{i}) \right| - \alpha \log \sum_{ip} Lpmap​=i=1∑N​​ip∑​⊙(P^i​−Pi​)​+​ip∑​⊙(∇P^i​−∇Pi​)​−αlogip∑​

  • Track Loss:

    Ltrack=∑j=1M∑i=1N∣yj,i−y^j,i∣L_{track} = \sum_{j=1}^{M} \sum_{i=1}^{N} \left| y_{j,i} - \hat{y}_{j,i} \right| Ltrack​=j=1∑M​i=1∑N​∣yj,i​−y^​j,i​∣

Limitations

  • Does not support fisheye or panoramic images.
  • Reconstruction quality is low when input images have extreme rotations.
  • Inference memory usage increases rapidly as the number of input images grows.
上一篇

EGS-SLAM

下一篇

EvGGS

©2025 By Tesla
Quiet主题