Dinomaly

Dinomaly#

Dinomaly: Vision Transformer-based Anomaly Detection with Feature Reconstruction.

This module implements the Dinomaly model for anomaly detection using a Vision Transformer encoder-decoder architecture. The model leverages pre-trained DINOv2 features and employs a reconstruction-based approach to detect anomalies by comparing encoder and decoder features.

Dinomaly extracts features from multiple intermediate layers of a DINOv2 Vision Transformer, compresses them through a bottleneck MLP, and reconstructs them using a Vision Transformer decoder. Anomaly detection is performed by computing cosine similarity between encoder and decoder features at multiple scales.

The model is particularly effective for visual anomaly detection tasks where the goal is to identify regions or images that deviate from normal patterns learned during training.

Example

>>> from anomalib.data import MVTecAD
>>> from anomalib.models import Dinomaly
>>> from anomalib.engine import Engine

>>> datamodule = MVTecAD()
>>> model = Dinomaly()
>>> engine = Engine()

>>> engine.fit(model, datamodule=datamodule)
>>> predictions = engine.predict(model, datamodule=datamodule)

Notes

The model uses DINOv2 Vision Transformer as the backbone encoder
Features are extracted from intermediate layers (typically layers 2-9 for base models)
A bottleneck MLP compresses multi-layer features before reconstruction
Anomaly maps are computed using cosine similarity between encoder-decoder features
The model supports both unsupervised anomaly detection and localization

See also

anomalib.models.image.dinomaly.torch_model.DinomalyModel:: PyTorch implementation of the Dinomaly model.

class anomalib.models.image.dinomaly.lightning_model.Dinomaly(encoder_name='vit_base_patch14_reg4_dinov2', bottleneck_dropout=0.2, decoder_depth=8, target_layers=None, fuse_layer_encoder=None, fuse_layer_decoder=None, remove_class_token=False, use_context_recentering=False, precision=PrecisionType.FLOAT32, pre_processor=True, post_processor=True, evaluator=True, visualizer=True)#

Bases: AnomalibModule

Dinomaly Lightning Module for Vision Transformer-based Anomaly Detection.

This lightning module trains the Dinomaly anomaly detection model (DinomalyModel). During training, the decoder learns to reconstruct normal features. During inference, the trained decoder is expected to successfully reconstruct normal regions of feature maps, but fail to reconstruct anomalous regions as it has not seen such patterns.

Parameters:

encoder_name (str) – Name of the Vision Transformer encoder to use. Supports DINOv2 variants (small, base, large) with different patch sizes. Defaults to “vit_base_patch14_reg4_dinov2”.
bottleneck_dropout (float) – Dropout rate for the bottleneck MLP layer. Helps prevent overfitting during feature compression. Defaults to 0.2.
decoder_depth (int) – Number of Vision Transformer decoder layers. More layers allow for more complex reconstruction. Defaults to 8.
target_layers (list[int] | None) – List of encoder layer indices to extract features from. If None, uses [2, 3, 4, 5, 6, 7, 8, 9] for base models and [4, 6, 8, 10, 12, 14, 16, 18] for large models.
fuse_layer_encoder (list[list[int]] | None) – Groupings of encoder layers for feature fusion. If None, uses [[0, 1, 2, 3], [4, 5, 6, 7]].
fuse_layer_decoder (list[list[int]] | None) – Groupings of decoder layers for feature fusion. If None, uses [[0, 1, 2, 3], [4, 5, 6, 7]].
remove_class_token (bool) – Whether to remove class token from features before processing. Defaults to False.
use_context_recentering (bool) – Whether to apply Context-Aware Recentering from Dinomaly2. When enabled, the class token is subtracted from patch features before reconstruction. Most beneficial in multi-class settings. Incompatible with remove_class_token=True. Defaults to False.
precision (str | PrecisionType) – Numerical precision for model parameters. Supports “float16” and “float32”. Defaults to “float32”.
pre_processor (PreProcessor | bool) – Pre-processor instance or flag to use default. Defaults to True.
post_processor (PostProcessor | bool) – Post-processor instance or flag to use default. Defaults to True.
evaluator (Evaluator | bool) – Evaluator instance or flag to use default. Defaults to True.
visualizer (Visualizer | bool) – Visualizer instance or flag to use default. Defaults to True.

Example

>>> from anomalib.data import MVTecAD
>>> from anomalib.models import Dinomaly
>>>
>>> # Basic usage with default parameters
>>> model = Dinomaly()
>>>
>>> # Custom configuration
>>> model = Dinomaly(
...     encoder_name="vit_large_patch14_reg4_dinov2",
...     decoder_depth=12,
...     bottleneck_dropout=0.1,
...     mask_neighbor_size=3
... )
>>>
>>> # Training with datamodule
>>> datamodule = MVTecAD()
>>> engine = Engine()
>>> engine.fit(model, datamodule=datamodule)

Note

The model requires significant GPU memory due to the Vision Transformer architecture. Consider using gradient checkpointing or smaller model variants for memory-constrained environments.

configure_optimizers()#

Configure optimizer and learning rate scheduler for Dinomaly training.

Sets up the training configuration with frozen DINOv2 encoder and trainable bottleneck and decoder components. Uses StableAdamW optimizer with warm cosine learning rate scheduling.

The total number of training steps is determined dynamically from the trainer configuration, supporting both max_steps and max_epochs settings.

Returns:: Tuple containing optimizer and scheduler configurations.
Return type:: Union[Optimizer, Sequence[Optimizer], tuple[Sequence[Optimizer], Sequence[Union[LRScheduler, ReduceLROnPlateau, LRSchedulerConfig]]], OptimizerConfig, OptimizerLRSchedulerConfig, Sequence[OptimizerConfig], Sequence[OptimizerLRSchedulerConfig], None]
Raises:: ValueError – If neither max_epochs nor max_steps is defined.

Note

DINOv2 encoder parameters are frozen to preserve pre-trained features
Only bottleneck MLP and decoder parameters are trained
Uses truncated normal initialization for Linear layers
Learning rate schedule: warmup (100 steps) + cosine decay
Base learning rate: 2e-3, final learning rate: 2e-4
Total steps determined from trainer’s max_steps or max_epochs

classmethod configure_pre_processor(image_size=None, crop_size=None)#

Configure the default pre-processor for Dinomaly.

Sets up image preprocessing pipeline including resizing, center cropping, and normalization with ImageNet statistics. The preprocessing is optimized for DINOv2 Vision Transformer models.

Parameters:

image_size (tuple[int, int] | None) – Target size for image resizing as (height, width). Defaults to (448, 448).
crop_size (int | None) – Target size for center cropping (assumes square crop). Should be smaller than image_size. Defaults to 392.

Returns:

Configured pre-processor with transforms for Dinomaly.

Return type:

PreProcessor

Raises:

ValueError – If crop_size is larger than the minimum dimension of image_size.

Note

The default ImageNet normalization statistics are used: - Mean: [0.485, 0.456, 0.406] - Std: [0.229, 0.224, 0.225]

property learning_type: LearningType#

Return the learning type of the model.

Dinomaly is an unsupervised anomaly detection model that learns normal data patterns without requiring anomaly labels during training.

Returns:: Always returns LearningType.ONE_CLASS for unsupervised learning.
Return type:: LearningType

Note

This property may be subject to change if supervised training support is introduced in future versions.

on_load_checkpoint(checkpoint)#

Make checkpoints trained before the timm-encoder migration loadable.

The frozen DINOv2 encoder was migrated from a custom Vision Transformer to a frozen TimmFeatureExtractor. The legacy encoder weights are dropped and replaced by the current timm encoder weights so the strict state-dict load still succeeds. See restore_frozen_encoder_weights().

Parameters:: checkpoint (dict[str, Any]) – The checkpoint dictionary being loaded, modified in place.
Return type:: None

property trainer_arguments: dict[str, Any]#

Return Dinomaly-specific trainer arguments.

Provides configuration arguments optimized for Dinomaly training, excluding max_steps to allow users to set their own training duration.

Returns:

Dictionary of trainer arguments with strategy: configuration for optimal training performance. Does not include max_steps so it can be set by the engine or user.

Return type:

dict[str, Any]

Note

The max_steps is intentionally excluded to allow user override.

training_step(batch, *args, **kwargs)#

Training step for the Dinomaly model.

Performs a single training iteration by computing feature reconstruction loss between encoder and decoder features. Uses progressive cosine similarity loss with the hardest mining to focus training on difficult examples.

Parameters:

batch (Batch) – Input batch containing images and metadata.
*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns:

Dictionary containing the computed loss value.

Return type:

Union[Tensor, Mapping[str, Any], None]

Raises:

ValueError – If model output doesn’t contain required features during training.

Note

The loss function uses progressive weight scheduling where the hardest mining percentage increases from 0 to 0.9 over 1000 steps, focusing on increasingly difficult examples as training progresses.

validation_step(batch, *args, **kwargs)#

Validation step for the Dinomaly model.

Performs inference on the validation batch to compute anomaly scores and anomaly maps. The model operates in evaluation mode to generate predictions for anomaly detection evaluation.

Parameters:

batch (Batch) – Input batch containing images and metadata.
*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns:

Updated batch with pred_score (anomaly scores) and: anomaly_map (pixel-level anomaly maps) predictions.

Return type:

Union[Tensor, Mapping[str, Any], None]

Raises:

Exception – If an error occurs during validation inference.

Note

During validation, the model returns InferenceBatch with anomaly scores and maps computed from encoder-decoder feature comparisons.

PyTorch model for the Dinomaly model implementation.

Based on PyTorch Implementation of “Dinomaly” by guojiajeremy Reference: guojiajeremy/Dinomaly License: MIT

See also

anomalib.models.image.dinomaly.lightning_model.Dinomaly:: Dinomaly Lightning model.

class anomalib.models.image.dinomaly.torch_model.DinomalyModel(encoder_name='vit_base_patch14_reg4_dinov2', bottleneck_dropout=0.2, decoder_depth=8, target_layers=None, fuse_layer_encoder=None, fuse_layer_decoder=None, remove_class_token=False, use_context_recentering=False)#

Bases: Module

DinomalyModel: Vision Transformer-based anomaly detection model from Dinomaly.

This is a Vision Transformer-based anomaly detection model that uses an encoder-bottleneck-decoder architecture for feature reconstruction.

The architecture comprises three main components: + An Encoder: A pre-trained Vision Transformer (ViT), by default a ViT-Base/14 based dinov2-reg model which extracts universal and discriminative features from input images. + Bottleneck: A simple MLP that collects feature representations from the encoder’s middle-level layers. + Decoder: Composed of Transformer layers (by default 8 layers), it learns to reconstruct the middle-level features.

Parameters:

encoder_name (str) – Name of the Vision Transformer encoder to use. Supports DINO variants like “vit_base_patch14_reg4_dinov2”. Defaults to “vit_base_patch14_reg4_dinov2”.
bottleneck_dropout (float) – Dropout rate for the bottleneck MLP layer. Defaults to 0.2.
decoder_depth (int) – Number of Vision Transformer decoder layers. Defaults to 8.
target_layers (list[int] | None) – List of encoder layer indices to extract features from. If None, uses [2, 3, 4, 5, 6, 7, 8, 9] for base models. For large models, uses [4, 6, 8, 10, 12, 14, 16, 18].
fuse_layer_encoder (list[list[int]] | None) – Layer groupings for encoder feature fusion. If None, uses [[0, 1, 2, 3], [4, 5, 6, 7]].
fuse_layer_decoder (list[list[int]] | None) – Layer groupings for decoder feature fusion. If None, uses [[0, 1, 2, 3], [4, 5, 6, 7]].
remove_class_token (bool) – Whether to remove class token from features before processing. Defaults to False.
use_context_recentering (bool) – Whether to apply Context-Aware Recentering from Dinomaly2. When enabled, the class token is subtracted from patch features before reconstruction, conditioning the feature space on class-specific context. This is particularly beneficial for multi-class anomaly detection settings. Incompatible with remove_class_token=True. Defaults to False.

Example

>>> model = DinomalyModel(
...     encoder_name="vit_base_patch14_reg4_dinov2",
...     decoder_depth=8,
...     bottleneck_dropout=0.2
... )
>>> features = model(batch)

static calculate_anomaly_maps(source_feature_maps, target_feature_maps, out_size=392)#

Calculate anomaly maps by comparing encoder and decoder features.

Computes pixel-level anomaly maps by calculating cosine similarity between corresponding encoder (source) and decoder (target) feature maps. Lower cosine similarity indicates a higher anomaly likelihood.

Parameters:

source_feature_maps (list[Tensor]) – List of encoder feature maps from different layer groups.
target_feature_maps (list[Tensor]) – List of decoder feature maps from different layer groups.
out_size (int | tuple[int, int]) – Output size for anomaly maps. Defaults to 392.

Returns:

Tuple containing:

anomaly_map: Combined anomaly map averaged across all feature scales
anomaly_map_list: List of individual anomaly maps for each feature scale

Return type:

tuple[Tensor, list[Tensor]]

forward(batch, global_step=None)#

Forward pass of the Dinomaly model.

During training, the model extracts features from the encoder and decoder and returns them for loss computation. During inference, it computes anomaly maps by comparing encoder and decoder features using cosine similarity, applies Gaussian smoothing, and returns anomaly scores and maps.

Parameters:

batch (Tensor) – Input batch of images with shape (B, C, H, W).
global_step (int | None) – Current training step, used for loss computation.

Returns:

During training: Dictionary containing encoder and decoder features for loss computation.
During inference: InferenceBatch with pred_score (anomaly scores) and anomaly_map (pixel-level anomaly maps).

Return type:

Tensor | InferenceBatch

get_encoder_decoder_outputs(x)#

Extract and process features through encoder and decoder.

This method processes input images through the DINOv2 encoder to extract features from target layers, fuses them through a bottleneck MLP, and reconstructs them using the decoder. Features are reshaped for spatial anomaly map computation.

Parameters:

x (Tensor) – Input images with shape (B, C, H, W).

Returns:

Tuple containing:

en: List of fused encoder features reshaped to spatial dimensions
de: List of fused decoder features reshaped to spatial dimensions

Return type:

tuple[list[Tensor], list[Tensor]]

Dinomaly

Contents

Dinomaly#