AI VAD#

AI-VAD.

Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.

This module implements the AI-VAD model as described in the paper “AI-VAD: Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.”

The model extracts regions of interest from video frames using object detection and foreground detection, then computes attribute-based representations including velocity, pose and deep features for anomaly detection.

Example

>>> from anomalib.models.video import AiVad
>>> from anomalib.data import Avenue
>>> from anomalib.data.utils import VideoTargetFrame
>>> from anomalib.engine import Engine
>>> # Initialize model and datamodule
>>> datamodule = Avenue(
...     clip_length_in_frames=2,
...     frames_between_clips=1,
...     target_frame=VideoTargetFrame.LAST
... )
>>> model = AiVad()
>>> # Train using the engine
>>> engine = Engine()
>>> engine.fit(model=model, datamodule=datamodule)
Reference:

Tal Reiss, Yedid Hoshen. “AI-VAD: Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.” arXiv preprint arXiv:2212.00789 (2022). https://arxiv.org/pdf/2212.00789.pdf

class anomalib.models.video.ai_vad.lightning_model.AiVad(box_score_thresh=0.7, persons_only=False, min_bbox_area=100, max_bbox_overlap=0.65, enable_foreground_detections=True, foreground_kernel_size=3, foreground_binary_threshold=18, n_velocity_bins=1, use_velocity_features=True, use_pose_features=True, use_deep_features=True, n_components_velocity=2, n_neighbors_pose=1, n_neighbors_deep=1, pre_processor=True, post_processor=True, **kwargs)#

Bases: MemoryBankMixin, AnomalibModule

AI-VAD: Attribute-based Representations for Video Anomaly Detection.

This model extracts regions of interest from video frames using object detection and foreground detection, then computes attribute-based representations including velocity, pose and deep features for anomaly detection.

Parameters:
  • box_score_thresh (float, optional) – Confidence threshold for bounding box predictions. Defaults to 0.7.

  • persons_only (bool, optional) – When enabled, only regions labeled as person are included. Defaults to False.

  • min_bbox_area (int, optional) – Minimum bounding box area. Regions with surface area lower than this value are excluded. Defaults to 100.

  • max_bbox_overlap (float, optional) – Maximum allowed overlap between bounding boxes. Defaults to 0.65.

  • enable_foreground_detections (bool, optional) – Add additional foreground detections based on pixel difference between consecutive frames. Defaults to True.

  • foreground_kernel_size (int, optional) – Gaussian kernel size used in foreground detection. Defaults to 3.

  • foreground_binary_threshold (int, optional) – Value between 0 and 255 which acts as binary threshold in foreground detection. Defaults to 18.

  • n_velocity_bins (int, optional) – Number of discrete bins used for velocity histogram features. Defaults to 1.

  • use_velocity_features (bool, optional) – Flag indicating if velocity features should be used. Defaults to True.

  • use_pose_features (bool, optional) – Flag indicating if pose features should be used. Defaults to True.

  • use_deep_features (bool, optional) – Flag indicating if deep features should be used. Defaults to True.

  • n_components_velocity (int, optional) – Number of components used by GMM density estimation for velocity features. Defaults to 2.

  • n_neighbors_pose (int, optional) – Number of neighbors used in KNN density estimation for pose features. Defaults to 1.

  • n_neighbors_deep (int, optional) – Number of neighbors used in KNN density estimation for deep features. Defaults to 1.

  • pre_processor (PreProcessor | bool, optional) – Pre-processor instance or bool flag to enable default pre-processor. Defaults to True.

  • post_processor (PostProcessor | bool, optional) – Post-processor instance or bool flag to enable default post-processor. Defaults to True.

  • **kwargs – Additional keyword arguments passed to the parent class.

Example

>>> from anomalib.models.video import AiVad
>>> from anomalib.data import Avenue
>>> from anomalib.data.utils import VideoTargetFrame
>>> from anomalib.engine import Engine
>>> # Initialize model and datamodule
>>> datamodule = Avenue(
...     clip_length_in_frames=2,
...     frames_between_clips=1,
...     target_frame=VideoTargetFrame.LAST
... )
>>> model = AiVad()
>>> # Train using the engine
>>> engine = Engine()
>>> engine.fit(model=model, datamodule=datamodule)

Note

The model follows a one-class learning approach and does not require optimization during training. Instead, it builds density estimators based on extracted features from normal samples.

static configure_optimizers()#

AI-VAD training does not involve fine-tuning of NN weights, no optimizers needed.

Return type:

None

static configure_post_processor()#

Configure the post-processor for AI-VAD.

Returns:

One-class post-processor instance.

Return type:

PostProcessor

classmethod configure_pre_processor(image_size=None)#

Configure the pre-processor for AI-VAD.

AI-VAD does not need a pre-processor or transforms, as the region- and feature-extractors apply their own transforms.

Parameters:

image_size (tuple[int, int] | None, optional) – Image size (unused). Defaults to None.

Returns:

Empty pre-processor instance.

Return type:

PreProcessor

fit()#

Fit the density estimators to the extracted features from the training set.

Raises:

ValueError – If no regions were extracted during training.

Return type:

None

property learning_type: LearningType#

Get the learning type of the model.

Returns:

Learning type of the model (ONE_CLASS).

Return type:

LearningType

property trainer_arguments: dict[str, Any]#

Get AI-VAD specific trainer arguments.

Returns:

Dictionary of trainer arguments.

Return type:

dict[str, Any]

training_step(batch)#

Training Step of AI-VAD.

Extract features from the batch of clips and update the density estimators.

Parameters:

batch (VideoBatch) – Batch containing video frames and metadata.

Return type:

None

validation_step(batch, *args, **kwargs)#

Perform the validation step of AI-VAD.

Extract boxes and box scores from the input batch.

Parameters:
  • batch (VideoBatch) – Input batch containing video frames and metadata.

  • *args – Additional arguments (unused).

  • **kwargs – Additional keyword arguments (unused).

Returns:

Batch dictionary with added predictions and anomaly maps.

Return type:

STEP_OUTPUT

PyTorch model for AI-VAD model implementation.

This module implements the AI-VAD model as described in the paper “AI-VAD: Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.”

Example

>>> from anomalib.models.video import AiVad
>>> from anomalib.data import Avenue
>>> from anomalib.data.utils import VideoTargetFrame
>>> from anomalib.engine import Engine
>>> # Initialize model and datamodule
>>> datamodule = Avenue(
...     clip_length_in_frames=2,
...     frames_between_clips=1,
...     target_frame=VideoTargetFrame.LAST
... )
>>> model = AiVad()
>>> # Train using the engine
>>> engine = Engine()
>>> engine.fit(model=model, datamodule=datamodule)
Reference:

Tal Reiss, Yedid Hoshen. “AI-VAD: Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.” arXiv preprint arXiv:2212.00789 (2022). https://arxiv.org/pdf/2212.00789.pdf

class anomalib.models.video.ai_vad.torch_model.AiVadModel(box_score_thresh=0.8, persons_only=False, min_bbox_area=100, max_bbox_overlap=0.65, enable_foreground_detections=True, foreground_kernel_size=3, foreground_binary_threshold=18, n_velocity_bins=8, use_velocity_features=True, use_pose_features=True, use_deep_features=True, n_components_velocity=5, n_neighbors_pose=1, n_neighbors_deep=1)#

Bases: Module

AI-VAD model.

The model consists of several stages: 1. Flow extraction between consecutive frames 2. Region extraction using object detection and foreground detection 3. Feature extraction including velocity, pose and deep features 4. Density estimation for anomaly detection

Parameters:
  • box_score_thresh (float, optional) – Confidence threshold for region extraction stage. Defaults to 0.8.

  • persons_only (bool, optional) – When enabled, only regions labeled as person are included. Defaults to False.

  • min_bbox_area (int, optional) – Minimum bounding box area. Regions with a surface area lower than this value are excluded. Defaults to 100.

  • max_bbox_overlap (float, optional) – Maximum allowed overlap between bounding boxes. Defaults to 0.65.

  • enable_foreground_detections (bool, optional) – Add additional foreground detections based on pixel difference between consecutive frames. Defaults to True.

  • foreground_kernel_size (int, optional) – Gaussian kernel size used in foreground detection. Defaults to 3.

  • foreground_binary_threshold (int, optional) – Value between 0 and 255 which acts as binary threshold in foreground detection. Defaults to 18.

  • n_velocity_bins (int, optional) – Number of discrete bins used for velocity histogram features. Defaults to 8.

  • use_velocity_features (bool, optional) – Flag indicating if velocity features should be used. Defaults to True.

  • use_pose_features (bool, optional) – Flag indicating if pose features should be used. Defaults to True.

  • use_deep_features (bool, optional) – Flag indicating if deep features should be used. Defaults to True.

  • n_components_velocity (int, optional) – Number of components used by GMM density estimation for velocity features. Defaults to 5.

  • n_neighbors_pose (int, optional) – Number of neighbors used in KNN density estimation for pose features. Defaults to 1.

  • n_neighbors_deep (int, optional) – Number of neighbors used in KNN density estimation for deep features. Defaults to 1.

Raises:

ValueError – If none of the feature types (velocity, pose, deep) are enabled.

Example

>>> from anomalib.models.video.ai_vad.torch_model import AiVadModel
>>> model = AiVadModel()
>>> batch = torch.randn(32, 2, 3, 256, 256)  # (N, L, C, H, W)
>>> output = model(batch)
>>> output.pred_score.shape
torch.Size([32])
>>> output.anomaly_map.shape
torch.Size([32, 256, 256])
forward(batch)#

Forward pass through AI-VAD model.

The forward pass consists of the following steps: 1. Extract first and last frame from input clip 2. Extract optical flow between frames and detect regions of interest 3. Extract features (velocity, pose, deep) for each region 4. Estimate density and compute anomaly scores

Parameters:

batch (torch.Tensor) – Input tensor of shape (N, L, C, H, W) where: - N: Batch size - L: Sequence length - C: Number of channels - H: Height - W: Width

Returns:

Batch containing:
  • pred_score: Per-image anomaly scores of shape (N,)

  • anomaly_map: Per-pixel anomaly scores of shape (N, H, W)

Return type:

InferenceBatch

Example

>>> batch = torch.randn(32, 2, 3, 256, 256)
>>> model = AiVadModel()
>>> output = model(batch)
>>> output.pred_score.shape, output.anomaly_map.shape
(torch.Size([32]), torch.Size([32, 256, 256]))

Feature extraction module for AI-VAD model implementation.

This module implements the feature extraction stage of the AI-VAD model. It extracts three types of features from video regions:

  • Velocity features: Histogram of optical flow magnitudes

  • Pose features: Human keypoint detections using KeypointRCNN

  • Deep features: CLIP embeddings of region crops

Example

>>> from anomalib.models.video.ai_vad.features import FeatureExtractor
>>> import torch
>>> extractor = FeatureExtractor()
>>> frames = torch.randn(32, 2, 3, 256, 256)  # (N, L, C, H, W)
>>> flow = torch.randn(32, 2, 256, 256)  # (N, 2, H, W)
>>> regions = [{"boxes": torch.randn(5, 4)}] * 32  # List of region dicts
>>> features = extractor(frames, flow, regions)
The module provides the following components:
class anomalib.models.video.ai_vad.features.DeepExtractor#

Bases: Module

Deep feature extractor.

Extracts deep (appearance) features from input regions using a CLIP vision encoder.

The extractor uses a pre-trained ViT-B/16 CLIP model to encode image regions into a 512-dimensional feature space. Input regions are resized to 224x224 and normalized using CLIP’s default preprocessing.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.features import DeepExtractor
>>> extractor = DeepExtractor()
>>> batch = torch.randn(32, 3, 256, 256)  # (N, C, H, W)
>>> boxes = torch.tensor([[0, 10, 20, 50, 60]])  # (M, 5) with batch indices
>>> features = extractor(batch, boxes, batch_size=32)
>>> features.shape
torch.Size([1, 512])
forward(batch, boxes, batch_size)#

Extract deep features using CLIP encoder.

Parameters:
  • batch (torch.Tensor) – Batch of RGB input images of shape (N, 3, H, W)

  • boxes (torch.Tensor) – Bounding box coordinates of shape (M, 5). First column indicates batch index of the bbox, remaining columns are coordinates [x1, y1, x2, y2].

  • batch_size (int) – Number of images in the batch.

Returns:

Deep feature tensor of shape (M, 512), where M is

the number of input regions and 512 is the CLIP feature dimension. Returns empty tensor if no valid regions.

Return type:

torch.Tensor

class anomalib.models.video.ai_vad.features.FeatureExtractor(n_velocity_bins=8, use_velocity_features=True, use_pose_features=True, use_deep_features=True)#

Bases: Module

Feature extractor for AI-VAD.

Extracts velocity, pose and deep features from video regions based on the enabled feature types.

Parameters:
  • n_velocity_bins (int, optional) – Number of discrete bins used for velocity histogram features. Defaults to 8.

  • use_velocity_features (bool, optional) – Flag indicating if velocity features should be used. Defaults to True.

  • use_pose_features (bool, optional) – Flag indicating if pose features should be used. Defaults to True.

  • use_deep_features (bool, optional) – Flag indicating if deep features should be used. Defaults to True.

Raises:

ValueError – If none of the feature types (velocity, pose, deep) are enabled.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.features import FeatureExtractor
>>> extractor = FeatureExtractor()
>>> rgb_batch = torch.randn(32, 3, 256, 256)  # (N, C, H, W)
>>> flow_batch = torch.randn(32, 2, 256, 256)  # (N, 2, H, W)
>>> regions = [{"boxes": torch.randn(5, 4)}] * 32  # List of region dicts
>>> features = extractor(rgb_batch, flow_batch, regions)
>>> # Returns list of dicts with keys: velocity, pose, deep
forward(rgb_batch, flow_batch, regions)#

Forward pass through the feature extractor.

Extract any combination of velocity, pose and deep features depending on configuration.

Parameters:
  • rgb_batch (torch.Tensor) – Batch of RGB images of shape (N, 3, H, W).

  • flow_batch (torch.Tensor) – Batch of optical flow images of shape (N, 2, H, W).

  • regions (list[dict]) – Region information per image in batch. Each dict contains bounding boxes of shape (M, 4).

Returns:

Feature dictionary per image in batch. Each dict contains

the enabled feature types as keys with corresponding feature tensors as values.

Return type:

list[dict]

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.features import FeatureExtractor
>>> extractor = FeatureExtractor()
>>> rgb_batch = torch.randn(32, 3, 256, 256)  # (N, C, H, W)
>>> flow_batch = torch.randn(32, 2, 256, 256)  # (N, 2, H, W)
>>> regions = [{"boxes": torch.randn(5, 4)}] * 32  # List of region dicts
>>> features = extractor(rgb_batch, flow_batch, regions)
>>> features[0].keys()  # Features for first image
dict_keys(['velocity', 'pose', 'deep'])
class anomalib.models.video.ai_vad.features.FeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#

Bases: str, Enum

Names of the different feature streams used in AI-VAD.

This enum defines the available feature types that can be extracted from video regions in the AI-VAD model.

POSE#

Keypoint features extracted using KeypointRCNN model

VELOCITY#

Histogram features computed from optical flow magnitudes

DEEP#

Visual embedding features extracted using CLIP model

Example

>>> from anomalib.models.video.ai_vad.features import FeatureType
>>> feature_type = FeatureType.POSE
>>> feature_type
<FeatureType.POSE: 'pose'>
>>> feature_type == "pose"
True
>>> feature_type in [FeatureType.POSE, FeatureType.VELOCITY]
True
class anomalib.models.video.ai_vad.features.PoseExtractor(*args, **kwargs)#

Bases: Module

Pose feature extractor.

Extracts pose features based on estimated body landmark keypoints using a KeypointRCNN model.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.features import PoseExtractor
>>> extractor = PoseExtractor()
>>> batch = torch.randn(2, 3, 256, 256)  # (N, C, H, W)
>>> boxes = torch.tensor([[0, 10, 10, 50, 50], [1, 20, 20, 60, 60]])
>>> features = extractor(batch, boxes)
>>> # Returns list of pose feature tensors for each image
forward(batch, boxes)#

Extract pose features using a human keypoint estimation model.

The method performs the following steps: 1. Transform input images 2. Extract backbone features 3. Pool ROI features for each box 4. Predict keypoint locations 5. Post-process predictions

Parameters:
  • batch (torch.Tensor) – Batch of RGB input images of shape (N, 3, H, W).

  • boxes (torch.Tensor) – Bounding box coordinates of shape (M, 5). First column indicates batch index of the bbox, remaining columns are coordinates [x1, y1, x2, y2].

Returns:

List of pose feature tensors for each image, where

each tensor contains normalized keypoint coordinates.

Return type:

list[torch.Tensor]

class anomalib.models.video.ai_vad.features.VelocityExtractor(n_bins=8)#

Bases: Module

Velocity feature extractor.

Extracts histograms of optical flow magnitude and direction from video regions. The histograms capture motion patterns by binning flow vectors based on their direction and weighting by magnitude.

Parameters:

n_bins (int, optional) – Number of direction bins used for the feature histograms. Defaults to 8.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.features import VelocityExtractor
>>> extractor = VelocityExtractor(n_bins=8)
>>> flows = torch.randn(32, 2, 256, 256)  # (N, 2, H, W)
>>> boxes = torch.tensor([[0, 10, 20, 50, 60]])  # (M, 5) with batch indices
>>> features = extractor(flows, boxes)
>>> features.shape
torch.Size([1, 8])
forward(flows, boxes)#

Extract velocity features by computing flow direction histograms.

For each region, computes a histogram of optical flow directions weighted by flow magnitudes. The flow vectors are converted from cartesian to polar coordinates, with directions binned into n_bins equal intervals between and π. The histogram values are normalized by the bin counts.

Parameters:
  • flows (torch.Tensor) – Batch of optical flow images of shape (N, 2, H, W), where the second dimension contains x and y flow components.

  • boxes (torch.Tensor) – Bounding box coordinates of shape (M, 5). First column indicates batch index of the bbox, remaining columns are coordinates [x1, y1, x2, y2].

Returns:

Velocity feature tensor of shape (M, n_bins), where

M is the number of input regions. Returns empty tensor if no valid regions.

Return type:

torch.Tensor

Regions extraction module of AI-VAD model implementation.

This module implements the region extraction stage of the AI-VAD model. It extracts regions of interest from video frames using object detection and foreground detection.

Example

>>> from anomalib.models.video.ai_vad.regions import RegionExtractor
>>> import torch
>>> extractor = RegionExtractor()
>>> frames = torch.randn(32, 2, 3, 256, 256)  # (N, L, C, H, W)
>>> regions = extractor(frames)
The module provides the following components:
  • RegionExtractor: Main class that handles region extraction using object detection and foreground detection

class anomalib.models.video.ai_vad.regions.RegionExtractor(box_score_thresh=0.8, persons_only=False, min_bbox_area=100, max_bbox_overlap=0.65, enable_foreground_detections=True, foreground_kernel_size=3, foreground_binary_threshold=18)#

Bases: Module

Region extractor for AI-VAD.

This class extracts regions of interest from video frames using object detection and foreground detection. It uses a Mask R-CNN model for object detection and can optionally detect foreground regions based on frame differences.

Parameters:
  • box_score_thresh (float, optional) – Confidence threshold for bounding box predictions. Defaults to 0.8.

  • persons_only (bool, optional) – When enabled, only regions labeled as person are included. Defaults to False.

  • min_bbox_area (int, optional) – Minimum bounding box area. Regions with a surface area lower than this value are excluded. Defaults to 100.

  • max_bbox_overlap (float, optional) – Maximum allowed overlap between bounding boxes. Defaults to 0.65.

  • enable_foreground_detections (bool, optional) – Add additional foreground detections based on pixel difference between consecutive frames. Defaults to True.

  • foreground_kernel_size (int, optional) – Gaussian kernel size used in foreground detection. Defaults to 3.

  • foreground_binary_threshold (int, optional) – Value between 0 and 255 which acts as binary threshold in foreground detection. Defaults to 18.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.regions import RegionExtractor
>>> extractor = RegionExtractor()
>>> first_frame = torch.randn(2, 3, 256, 256)  # (N, C, H, W)
>>> last_frame = torch.randn(2, 3, 256, 256)  # (N, C, H, W)
>>> regions = extractor(first_frame, last_frame)
>>> # Returns list of dicts with keys: boxes, labels, scores, masks
forward(first_frame, last_frame)#

Perform forward-pass through region extractor.

The forward pass consists of: 1. Object detection on the last frame using Mask R-CNN 2. Optional foreground detection by comparing first and last frames 3. Post-processing to filter and refine detections

Parameters:
  • first_frame (torch.Tensor) – Batch of input images of shape (N, C, H, W) forming the first frames in the clip.

  • last_frame (torch.Tensor) – Batch of input images of shape (N, C, H, W) forming the last frame in the clip.

Returns:

List of Mask R-CNN predictions for each image in the batch. Each

dict contains: - boxes (torch.Tensor): Detected bounding boxes - labels (torch.Tensor): Class labels for each detection - scores (torch.Tensor): Confidence scores for each detection - masks (torch.Tensor): Instance segmentation masks

Return type:

list[dict]

post_process_bbox_detections(regions)#

Post-process the region detections.

The region detections are filtered based on class label, bbox area and overlap with other regions.

Parameters:

regions (list[dict[str, torch.Tensor]]) – Region detections for a batch of images, generated by the region extraction module.

Returns:

Filtered regions containing only valid

detections based on the filtering criteria.

Return type:

list[dict[str, torch.Tensor]]

static subsample_regions(regions, indices)#

Subsample the items in a region dictionary based on a Tensor of indices.

Parameters:
  • regions (dict[str, torch.Tensor]) – Region detections for a single image in the batch.

  • indices (torch.Tensor) – Indices of region detections that should be kept.

Returns:

Subsampled region detections containing only the

specified indices.

Return type:

dict[str, torch.Tensor]

Optical Flow extraction module for AI-VAD implementation.

This module implements the optical flow extraction stage of the AI-VAD model. It uses RAFT (Recurrent All-Pairs Field Transforms) to compute dense optical flow between consecutive video frames.

Example

>>> from anomalib.models.video.ai_vad.flow import FlowExtractor
>>> import torch
>>> extractor = FlowExtractor()
>>> first_frame = torch.randn(32, 3, 256, 256)  # (N, C, H, W)
>>> last_frame = torch.randn(32, 3, 256, 256)  # (N, C, H, W)
>>> flow = extractor(first_frame, last_frame)
>>> flow.shape
torch.Size([32, 2, 256, 256])
The module provides the following components:
  • FlowExtractor: Main class that handles optical flow computation using RAFT model

class anomalib.models.video.ai_vad.flow.FlowExtractor(*args, **kwargs)#

Bases: Module

Optical Flow extractor.

Computes the pixel displacement between 2 consecutive frames from a video clip.

forward(first_frame, last_frame)#

Forward pass through the flow extractor.

Parameters:
  • first_frame (torch.Tensor) – Batch of starting frames of shape (N, 3, H, W).

  • last_frame (torch.Tensor) – Batch of last frames of shape (N, 3, H, W).

Returns:

Estimated optical flow map of shape (N, 2, H, W).

Return type:

Tensor

pre_process(first_frame, last_frame)#

Resize inputs to dimensions required by backbone.

Parameters:
  • first_frame (torch.Tensor) – Starting frame of optical flow computation.

  • last_frame (torch.Tensor) – Last frame of optical flow computation.

Returns:

Preprocessed first and last frame.

Return type:

tuple[torch.Tensor, torch.Tensor]

Density estimation module for AI-VAD model implementation.

This module implements the density estimation stage of the AI-VAD model. It provides density estimators for modeling the distribution of extracted features from normal video samples.

The module provides the following components:

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.density import CombinedDensityEstimator
>>> from anomalib.models.video.ai_vad.features import FeatureType
>>> estimator = CombinedDensityEstimator()
>>> features = {
...     FeatureType.VELOCITY: torch.randn(32, 8),
...     FeatureType.POSE: torch.randn(32, 34),
...     FeatureType.DEEP: torch.randn(32, 512)
... }
>>> scores = estimator(features)  # Returns anomaly scores during inference

The density estimators are used to model the distribution of normal behavior and detect anomalies as samples with low likelihood under the learned distributions.

class anomalib.models.video.ai_vad.density.BaseDensityEstimator(*args, **kwargs)#

Bases: Module, ABC

Abstract base class for density estimators.

This class defines the interface for density estimators used in the AI-VAD model. Subclasses must implement methods for updating the density model with new features, predicting densities for test samples, and fitting the model.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.density import BaseDensityEstimator
>>> class MyEstimator(BaseDensityEstimator):
...     def update(self, features, group=None):
...         pass
...     def predict(self, features):
...         return torch.rand(features.shape[0])
...     def fit(self):
...         pass
>>> estimator = MyEstimator()
>>> features = torch.randn(32, 8)
>>> scores = estimator(features)  # Forward pass returns predictions
abstract fit()#

Compose model using collected features.

This method should be called after updating the model with features to fit the density estimator to the collected data.

Return type:

None

forward(features)#

Forward pass that either updates or predicts based on training status.

Parameters:

features (dict[FeatureType, torch.Tensor] | torch.Tensor) – Input features. Can be either a dictionary mapping feature types to tensors, or a single tensor.

Returns:

During

training, returns None after updating. During inference, returns density predictions.

Return type:

torch.Tensor | tuple[torch.Tensor, torch.Tensor] | None

abstract predict(features)#

Predict the density of a set of features.

Parameters:

features (dict[FeatureType, torch.Tensor] | torch.Tensor) – Input features to compute density for. Can be either a dictionary mapping feature types to tensors, or a single tensor.

Returns:

Predicted density

scores. May return either a single tensor of scores or a tuple of tensors for more complex estimators.

Return type:

torch.Tensor | tuple[torch.Tensor, torch.Tensor]

abstract update(features, group=None)#

Update the density model with a new set of features.

Parameters:
  • features (dict[FeatureType, torch.Tensor] | torch.Tensor) – Input features to update the model. Can be either a dictionary mapping feature types to tensors, or a single tensor.

  • group (str | None, optional) – Optional group identifier for grouped density estimation. Defaults to None.

Return type:

None

class anomalib.models.video.ai_vad.density.CombinedDensityEstimator(use_pose_features=True, use_deep_features=True, use_velocity_features=False, n_neighbors_pose=1, n_neighbors_deep=1, n_components_velocity=5)#

Bases: BaseDensityEstimator

Density estimator for AI-VAD.

Combines density estimators for the different feature types included in the model.

Parameters:
  • use_pose_features (bool, optional) – Flag indicating if pose features should be used. Defaults to True.

  • use_deep_features (bool, optional) – Flag indicating if deep features should be used. Defaults to True.

  • use_velocity_features (bool, optional) – Flag indicating if velocity features should be used. Defaults to False.

  • n_neighbors_pose (int, optional) – Number of neighbors used in KNN density estimation for pose features. Defaults to 1.

  • n_neighbors_deep (int, optional) – Number of neighbors used in KNN density estimation for deep features. Defaults to 1.

  • n_components_velocity (int, optional) – Number of components used by GMM density estimation for velocity features. Defaults to 5.

Raises:

ValueError – If none of the feature types (velocity, pose, deep) are enabled.

Example

>>> from anomalib.models.video.ai_vad.density import CombinedDensityEstimator
>>> estimator = CombinedDensityEstimator(
...     use_pose_features=True,
...     use_deep_features=True,
...     use_velocity_features=True,
...     n_neighbors_pose=1,
...     n_neighbors_deep=1,
...     n_components_velocity=5
... )
>>> # Update with features from training data
>>> estimator.update(features, group="video_001")
>>> # Fit the density estimators
>>> estimator.fit()
>>> # Get predictions for test data
>>> region_scores, image_score = estimator.predict(features)
fit()#

Fit the density estimation models on the collected features.

This method should be called after updating with all training features to fit the density estimators to the collected data.

Return type:

None

predict(features)#

Predict region and image-level anomaly scores.

Computes anomaly scores for each region in the frame and an overall frame score based on the maximum region score.

Parameters:

features (dict[FeatureType, torch.Tensor]) – Dictionary containing extracted features for a single frame. Keys are feature types and values are the corresponding feature tensors.

Returns:

A tuple containing:
  • Region-level anomaly scores for all regions within the frame

  • Frame-level anomaly score for the frame

Return type:

tuple[torch.Tensor, torch.Tensor]

Example

>>> features = {
...     FeatureType.VELOCITY: velocity_features,
...     FeatureType.DEEP: deep_features,
...     FeatureType.POSE: pose_features
... }
>>> region_scores, image_score = estimator.predict(features)
update(features, group=None)#

Update the density estimators for the different feature types.

Parameters:
  • features (dict[FeatureType, torch.Tensor]) – Dictionary containing extracted features for a single frame. Keys are feature types and values are the corresponding feature tensors.

  • group (str | None, optional) – Identifier of the video from which the frame was sampled. Used for grouped density estimation. Defaults to None.

Return type:

None

class anomalib.models.video.ai_vad.density.GMMEstimator(n_components=2)#

Bases: BaseDensityEstimator

Density estimation based on Gaussian Mixture Model.

Fits a GMM to the training features and uses the negative log-likelihood as an anomaly score during inference.

Parameters:

n_components (int, optional) – Number of Gaussian components used in the GMM. Defaults to 2.

Example

>>> import torch
>>> from anomalib.models.video.ai_vad.density import GMMEstimator
>>> estimator = GMMEstimator(n_components=2)
>>> features = torch.randn(32, 8)  # (N, D)
>>> estimator.update(features)
>>> estimator.fit()
>>> scores = estimator.predict(features)
>>> scores.shape
torch.Size([32])
fit()#

Fit the GMM and compute normalization statistics.

Concatenates all features in the memory bank, fits the GMM to the combined features, and computes min-max normalization statistics over the training scores.

Return type:

None

predict(features, normalize=True)#

Predict anomaly scores for input features.

Computes the negative log-likelihood of each feature vector under the fitted GMM. Lower likelihood (higher score) indicates more anomalous samples.

Parameters:
  • features (torch.Tensor) – Input feature vectors of shape (N, D).

  • normalize (bool, optional) – Whether to normalize scores using min-max statistics from training. Defaults to True.

Returns:

Anomaly scores of shape (N,). Higher values indicate

more anomalous samples.

Return type:

torch.Tensor

update(features, group=None)#

Update the feature bank with new features.

Parameters:
  • features (torch.Tensor) – Feature vectors of shape (N, D) to add to the memory bank.

  • group (str | None, optional) – Unused group parameter included for interface compatibility. Defaults to None.

Return type:

None

class anomalib.models.video.ai_vad.density.GroupedKNNEstimator(n_neighbors)#

Bases: DynamicBufferMixin, BaseDensityEstimator

Grouped KNN density estimator.

Keeps track of the group (e.g. video id) from which the features were sampled for normalization purposes.

Parameters:

n_neighbors (int) – Number of neighbors used in KNN search.

Example

>>> from anomalib.models.video.ai_vad.density import GroupedKNNEstimator
>>> import torch
>>> estimator = GroupedKNNEstimator(n_neighbors=5)
>>> features = torch.randn(32, 512)  # (N, D)
>>> estimator.update(features, group="video_1")
>>> estimator.fit()
>>> scores = estimator.predict(features)
>>> scores.shape
torch.Size([32])
fit()#

Fit the KNN model by stacking features and computing normalization stats.

Stacks the collected feature vectors group-wise and computes the normalization statistics. After fitting, the feature collection is deleted to free up memory.

Return type:

None

Example

>>> estimator = GroupedKNNEstimator(n_neighbors=5)
>>> features = torch.randn(32, 512)  # (N, D)
>>> estimator.update(features, group="video_1")
>>> estimator.fit()
predict(features, group=None, n_neighbors=1, normalize=True)#

Predict the (normalized) density for a set of features.

Parameters:
  • features (torch.Tensor) – Input features of shape (N, D) that will be compared to the density model.

  • group (str | None, optional) – Group (video id) from which the features originate. If passed, all features of the same group in the memory bank will be excluded from the density estimation. Defaults to None.

  • n_neighbors (int, optional) – Number of neighbors used in the KNN search. Defaults to 1.

  • normalize (bool, optional) – Flag indicating if the density should be normalized to min-max stats of the feature bank. Defaults to True.

Returns:

Mean (normalized) distances of input feature vectors to k

nearest neighbors in feature bank.

Return type:

torch.Tensor

Example

>>> estimator = GroupedKNNEstimator(n_neighbors=5)
>>> features = torch.randn(32, 512)  # (N, D)
>>> estimator.update(features, group="video_1")
>>> estimator.fit()
>>> scores = estimator.predict(features, group="video_1")
>>> scores.shape
torch.Size([32])
update(features, group=None)#

Update the internal feature bank while keeping track of the group.

Parameters:
  • features (torch.Tensor) – Feature vectors extracted from a video frame of shape (N, D).

  • group (str | None, optional) – Identifier of the group (video) from which the frame was sampled. Defaults to None.

Return type:

None

Example

>>> estimator = GroupedKNNEstimator(n_neighbors=5)
>>> features = torch.randn(32, 512)  # (N, D)
>>> estimator.update(features, group="video_1")