3D Object Detection Architecture

RGB Image

Raw Image Data

Color + Spatial Info

Point Cloud

Raw 3D Points

3D Geometric Info

3D Bounding Boxes

Corners (N × 8 × 3)

Ground Truth

RGB Preprocessing

Image Loading

Load rgb.jpg

BGR to RGB

Transformations

Resize to 480×608

Augment (Train): Flip, Rotate, Crop

Normalization

Mean: [0.485, 0.456, 0.406]

Std: [0.229, 0.224, 0.225]

Point Cloud Preprocessing

Point Cloud Loading

Load pc.npy

Reshape to (N, 3)

Filtering

Remove NaN, z > 0

Valid Points Only

Sampling/Padding

Max 8192 Points

Random Sample or Zero-Pad

Bounding Box Preprocessing

Load Corners

Load bbox3d.npy

(N × 8 × 3)

Parametric Conversion

Corners to (center, size, quat)

PCA-Based Fitting

Padding/Truncation

Max 21 Objects

Zero-Pad or Truncate

RGB Processing Branch

EfficientNet-B3

Pre-trained on ImageNet

1536D Features

Feature Projection

1536D → 512D

Linear + ReLU + Dropout

RGB Features

512-dimensional

Semantic Features

Point Cloud Processing Branch

DGCNN Backbone

Dynamic Graph CNN

k=20 neighbors

Edge Convolutions

3 Layers: [64, 64, 64]

Graph Feature Learning

Global Pooling

Max Pool → 1024D

Permutation Invariant

Feature Projection

1024D → 512D

Linear + ReLU + Dropout

Point Features

512-dimensional

Geometric Features

Transformer Fusion

Combining RGB and Point Cloud Features

Transformer Encoder

4 Layers, 8 Heads

512D Fused Features

3D Bounding Box

(x, y, z, w, h, l, q_w, q_x, q_y, q_z)

Center, Size, Quaternion

Confidence Score

Objectness Probability

Sigmoid Output

Note: DGCNN uses dynamic graph updates with k=20 neighbors to capture local geometric structures in point clouds.

Model Specifications

Input Modalities: RGB Image, Point Cloud (N×3, max 8192), 3D Bounding Boxes (N×8×3)

RGB Preprocessing: Resize to 480×608, Augment (Train), Normalize

Point Cloud Preprocessing: Filter NaN/z>0, Sample/Pad to 8192 Points

Bounding Box Preprocessing: Corners to (center, size, quat), Pad to 21 Objects

RGB Backbone: EfficientNet-B3, 1536D → 512D via Linear Projection

Point Cloud Backbone: DGCNN, 3 EdgeConv Layers [64, 64, 64], k=20

Fusion Method: Transformer Encoder (4 layers, 8 heads, 512D output)

Output: Up to 21 objects, each with (x, y, z, w, h, l, q_w, q_x, q_y, q_z) + Confidence

Training Dataset: Custom Dataset (data/dl_challenge)

Loss Function: L1 Loss (Bounding Box) + Binary Cross-Entropy (Confidence)

Hyperparameters: Batch Size=4, Epochs=250, LR=1e-4, Dropout=0.2