INP-Former

INP-Former#

Exploring Intrinsic Normal Prototypes within a Single Image for Universal Anomaly Detection.

INP-Former is trained on normal images using a feature-reconstruction framework based on DINOv2.

A frozen pre-trained encoder produces patch tokens, and an INP Extractor uses M learnable query tokens with cross-attention over those patch tokens to aggregate them into M Intrinsic Normal Prototypes (INPs) per image; an INP Coherence Loss pulls each patch feature toward its nearest INP so the INPs reliably capture that image’s normal patterns.

A Bottleneck fuses multi-scale encoder features, and an INP-Guided Decoder reconstructs them using the INPs as keys/values in cross-attention (with the first residual connection removed and a ReLU on attention weights), so its output is constrained to lie in the span of normal prototypes and anomalous queries cannot be reconstructed. A Soft Mining Loss upweights hard-to-reconstruct tokens during training, and at test time the per-token discrepancy between encoder features and decoder outputs is used as the anomaly score and map.

Example

>>> from anomalib.data import MVTecAD
>>> from anomalib.models import InpFormer
>>> from anomalib.engine import Engine

>>> datamodule = MVTecAD()
>>> model = InpFormer()
>>> engine = Engine()

>>> engine.fit(model, datamodule=datamodule)
>>> predictions = engine.predict(model, datamodule=datamodule)

Notes

The model uses DINOv2 Vision Transformer as the backbone encoder
Features are extracted from intermediate layers (typically layers 2-9 for base models)
A bottleneck MLP compresses multi-layer features before reconstruction
Anomaly maps are computed using cosine similarity between encoder-decoder features
The model supports both unsupervised anomaly detection and localization

See also

anomalib.models.image.inp_former.torch_model.InpFormerModel:: PyTorch implementation of the InpFormer model.

class anomalib.models.image.inp_former.lightning_model.InpFormer(encoder_name='vit_base_patch14_reg4_dinov2', target_layers=None, fuse_layer_encoder=None, fuse_layer_decoder=None, remove_class_token=True, inp_num=6, precision=PrecisionType.FLOAT32, pre_processor=True, post_processor=True, evaluator=True, visualizer=True)#

Bases: AnomalibModule

InpFormer Lightning Module for Vision Transformer-based Anomaly Detection.

This lightning module trains the INP-Former anomaly detection model (InpFormerModel). During training, the decoder learns to reconstruct normal features from Intrinsic Normal Prototypes (INPs) extracted from each image by an INP Extractor. During inference, INPs extracted from the test image guide the decoder to reconstruct normal regions successfully but fail on anomalous ones, and the per-token reconstruction error serves as the anomaly score.

Parameters:

encoder_name (str) – Name of the Vision Transformer encoder to use. Supports DINOv2 variants (small, base, large) with different patch sizes. Defaults to “vit_base_patch14_reg4_dinov2”.
target_layers (list[int] | None) – List of encoder layer indices to extract features from. If None, uses [2, 3, 4, 5, 6, 7, 8, 9] for base models and [4, 6, 8, 10, 12, 14, 16, 18] for large models.
fuse_layer_encoder (list[list[int]] | None) – Groupings of encoder layers for feature fusion. If None, uses [[0, 1, 2, 3], [4, 5, 6, 7]].
fuse_layer_decoder (list[list[int]] | None) – Groupings of decoder layers for feature fusion. If None, uses [[0, 1, 2, 3], [4, 5, 6, 7]].
remove_class_token (bool) – Whether to remove class token from features before processing. Defaults to True.
inp_num (int) – Number of Intrinsic Normal Prototypes (INPs) to extract per image. Defaults to 6.
precision (str | PrecisionType) – Precision type for model computations. Can be either a string ("float32", "float16") or a PrecisionType enum value. Defaults to PrecisionType.FLOAT32.
pre_processor (PreProcessor | bool) – Pre-processor instance or flag to use default. Defaults to True.
post_processor (PostProcessor | bool) – Post-processor instance or flag to use default. Defaults to True.
evaluator (Evaluator | bool) – Evaluator instance or flag to use default. Defaults to True.
visualizer (Visualizer | bool) – Visualizer instance or flag to use default. Defaults to True.

Example

>>> from anomalib.data import MVTecAD
>>> from anomalib.models import InpFormer
>>>
>>> # Basic usage with default parameters
>>> model = InpFormer()
>>>
>>> # Custom configuration
>>> model = InpFormer(
...     encoder_name="vit_large_patch14_reg4_dinov2",
...     inp_num=6
... )
>>>
>>> # Training with datamodule
>>> datamodule = MVTecAD()
>>> engine = Engine()
>>> engine.fit(model, datamodule=datamodule)

Note

The model requires significant GPU memory due to the Vision Transformer architecture. Consider using gradient checkpointing or smaller model variants for memory-constrained environments.

configure_optimizers()#

Configure optimizer and learning rate scheduler for INP-Former training.

Sets up the training configuration with frozen DINOv2 encoder and trainable bottleneck and decoder components. Uses StableAdamW optimizer with warm cosine learning rate scheduling.

The total number of training steps is determined dynamically from the trainer configuration, supporting both max_steps and max_epochs settings.

Returns:: Tuple containing optimizer and scheduler configurations.
Return type:: Union[Optimizer, Sequence[Optimizer], tuple[Sequence[Optimizer], Sequence[Union[LRScheduler, ReduceLROnPlateau, LRSchedulerConfig]]], OptimizerConfig, OptimizerLRSchedulerConfig, Sequence[OptimizerConfig], Sequence[OptimizerLRSchedulerConfig], None]
Raises:: ValueError – If neither max_epochs nor max_steps is defined.

classmethod configure_pre_processor(image_size=None, crop_size=None)#

Configure the default pre-processor for InpFormer.

Sets up image preprocessing pipeline including resizing, center cropping, and normalization with ImageNet statistics. The preprocessing is optimized for DINOv2 Vision Transformer models.

Parameters:

image_size (tuple[int, int] | None) – Target size for image resizing as (height, width). Defaults to (448, 448).
crop_size (int | None) – Target size for center cropping (assumes square crop). Should be smaller than image_size. Defaults to 392.

Returns:

Configured pre-processor with transforms for InpFormer.

Return type:

PreProcessor

Raises:

ValueError – If crop_size is larger than the minimum dimension of image_size.

Note

The default ImageNet normalization statistics are used: - Mean: [0.485, 0.456, 0.406] - Std: [0.229, 0.224, 0.225]

property learning_type: LearningType#

Return the learning type of the model.

INP-Former is an unsupervised anomaly detection model that learns normal data patterns without requiring anomaly labels during training.

Returns:: Always returns LearningType.ONE_CLASS for unsupervised learning.
Return type:: LearningType

Note

This property may be subject to change if supervised training support is introduced in future versions.

on_load_checkpoint(checkpoint)#

Make checkpoints trained before the timm-encoder migration loadable.

Older InpFormer checkpoints built the frozen encoder from a custom Vision Transformer (anomalib.models.image.dinomaly.components.vision_transformer); it is now a frozen TimmFeatureExtractor. The legacy encoder weights are dropped and replaced by the current timm encoder weights so the strict state-dict load still succeeds. See restore_frozen_encoder_weights().

Parameters:: checkpoint (dict[str, Any]) – The checkpoint dictionary being loaded, modified in place.
Return type:: None

property trainer_arguments: dict[str, Any]#

Return INP-Former-specific trainer arguments.

Provides configuration arguments optimized for INP-Former training, excluding max_steps to allow users to set their own training duration.

Returns:

Dictionary of trainer arguments with strategy: configuration for optimal training performance. Does not include max_epochs so it can be set by the engine or user.

Return type:

dict[str, Any]

Note

The max_epochs is intentionally excluded to allow user override.

training_step(batch, *args, **kwargs)#

Training step for the InpFormer model.

Performs a single training iteration by computing feature reconstruction loss between encoder and decoder features. Uses cosine similarity loss with hard mining and adaptive weighting based on distance ratios.

Parameters:

batch (Batch) – Input batch containing images and metadata.
*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns:

Dictionary containing the computed loss value.

Return type:

Union[Tensor, Mapping[str, Any], None]

Raises:

ValueError – If model output doesn’t contain required features during training.

validation_step(batch, *args, **kwargs)#

Validation step for the InpFormer model.

Performs inference on the validation batch to compute anomaly scores and anomaly maps. The model operates in evaluation mode to generate predictions for anomaly detection evaluation.

Parameters:

batch (Batch) – Input batch containing images and metadata.
*args – Additional positional arguments (unused).
**kwargs – Additional keyword arguments (unused).

Returns:

Updated batch with pred_score (anomaly scores) and: anomaly_map (pixel-level anomaly maps) predictions.

Return type:

Union[Tensor, Mapping[str, Any], None]

Raises:

Exception – If an error occurs during validation inference.

Note

During validation, the model returns InferenceBatch with anomaly scores and maps computed from encoder-decoder feature comparisons.

PyTorch model for the INP-Former model implementation.

Based on PyTorch Implementation of “INP-Former” by luow23 Reference: luow23/INP-Former License: MIT

See also

anomalib.models.image.inp_former.lightning_model.InpFormer:: INP-Former Lightning model.

class anomalib.models.image.inp_former.torch_model.InpFormerModel(encoder_name, inp_num=6, target_layers=None, fuse_layer_encoder=None, fuse_layer_decoder=None, remove_class_token=False)#

Bases: Module

PyTorch module implementing the INP-Former anomaly detection model.

The model consists of four components: a frozen pre-trained Vision Transformer encoder, an INP Extractor that aggregates encoder patch tokens into a small set of Intrinsic Normal Prototypes (INPs) via cross-attention with learnable queries, a bottleneck that fuses multi-scale encoder features, and an INP-Guided Decoder that reconstructs normal features using the INPs as keys and values. Anomaly scores are computed from the per-token cosine discrepancy between encoder and decoder features at multiple scales.

Parameters:

encoder_name (str) – Name of the pre-trained Vision Transformer backbone to use as the encoder (e.g., a DINOv2 variant).
inp_num (int) – Number of Intrinsic Normal Prototypes to extract per image. Defaults to 6.
target_layers (list[int] | None) – Indices of encoder layers from which to extract intermediate features for reconstruction. Defaults to None.
fuse_layer_encoder (list[list[int]] | None) – Groups of encoder layer indices to fuse together when forming the multi-scale encoder feature targets. Defaults to None.
fuse_layer_decoder (list[list[int]] | None) – Groups of decoder layer indices to fuse together when forming the multi-scale decoder outputs. Defaults to None.
remove_class_token (bool) – If True, the class token is dropped from the patch token sequence before INP extraction and reconstruction. Defaults to False.

static calculate_anomaly_maps(source_feature_maps, target_feature_maps, out_size=392)#

Calculate anomaly maps by comparing encoder and decoder features.

Computes pixel-level anomaly maps by calculating cosine similarity between corresponding encoder (source) and decoder (target) feature maps. Lower cosine similarity indicates a higher anomaly likelihood.

Parameters:

source_feature_maps (list[Tensor]) – List of encoder feature maps from different layer groups.
target_feature_maps (list[Tensor]) – List of decoder feature maps from different layer groups.
out_size (int | tuple[int, int]) – Output size for anomaly maps. Defaults to 392.

Returns:

Tuple containing:

anomaly_map: Combined anomaly map averaged across all feature scales
anomaly_map_list: List of individual anomaly maps for each feature scale

Return type:

tuple[Tensor, list[Tensor]]

forward(batch)#

Forward pass of the INPFormerModel model.

During training, the model extracts features from the encoder and decoder and returns them for loss computation. During inference, it computes anomaly maps by comparing encoder and decoder features using cosine similarity, applies Gaussian smoothing, and returns anomaly scores and maps.

Parameters:

batch (Tensor) – Input batch of images with shape (B, C, H, W).

Returns:

During training: Encoder and decoder features, INP coherence loss.
During inference: InferenceBatch with pred_score (anomaly scores) and anomaly_map (pixel-level anomaly maps).

Return type:

Tensor | InferenceBatch

get_encoder_decoder_inploss(x)#

Extract and process features through encoder and decoder.

This method processes input images through the DINOv2 encoder to extract features from target layers, fuses them through a bottleneck MLP, and reconstructs them using the decoder. Features are reshaped for spatial anomaly map computation. TODO

Parameters:

x (Tensor) – Input images with shape (B, C, H, W).

Returns:

Tuple containing:

en: List of fused encoder features reshaped to spatial dimensions
de: List of fused decoder features reshaped to spatial dimensions
inp_loss: INP coherence loss to guide INP Extractor

Return type:

tuple[list[Tensor], list[Tensor], Tensor]

get_inp_loss(query, keys)#

INP coherence loss helps to ensure that INPs represent normal features.

It minimizes the distances between individual normal features and the corresponding nearest INP.

Parameters:

query (Tensor) – Fused encoder features (element-wise average).
keys (Tensor) – Prototype visual token.

Returns:

INP coherence loss.

Return type:

Tensor

INP-Former

Contents

INP-Former#