AI VAD#
Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.
Paper https://arxiv.org/pdf/2212.00789.pdf
- class anomalib.models.video.ai_vad.lightning_model.AiVad(box_score_thresh=0.7, persons_only=False, min_bbox_area=100, max_bbox_overlap=0.65, enable_foreground_detections=True, foreground_kernel_size=3, foreground_binary_threshold=18, n_velocity_bins=1, use_velocity_features=True, use_pose_features=True, use_deep_features=True, n_components_velocity=2, n_neighbors_pose=1, n_neighbors_deep=1)#
Bases:
MemoryBankMixin,AnomalyModuleAI-VAD: Attribute-based Representations for Accurate and Interpretable Video Anomaly Detection.
- Parameters:
box_score_thresh (float) – Confidence threshold for bounding box predictions. Defaults to
0.7.persons_only (bool) – When enabled, only regions labeled as person are included. Defaults to
False.min_bbox_area (int) – Minimum bounding box area. Regions with a surface area lower than this value are excluded. Defaults to
100.max_bbox_overlap (float) – Maximum allowed overlap between bounding boxes. Defaults to
0.65.enable_foreground_detections (bool) – Add additional foreground detections based on pixel difference between consecutive frames. Defaults to
True.foreground_kernel_size (int) – Gaussian kernel size used in foreground detection. Defaults to
3.foreground_binary_threshold (int) – Value between 0 and 255 which acts as binary threshold in foreground detection. Defaults to
18.n_velocity_bins (int) – Number of discrete bins used for velocity histogram features. Defaults to
1.use_velocity_features (bool) – Flag indicating if velocity features should be used. Defaults to
True.use_pose_features (bool) – Flag indicating if pose features should be used. Defaults to
True.use_deep_features (bool) – Flag indicating if deep features should be used. Defaults to
True.n_components_velocity (int) – Number of components used by GMM density estimation for velocity features. Defaults to
2.n_neighbors_pose (int) – Number of neighbors used in KNN density estimation for pose features. Defaults to
1.n_neighbors_deep (int) – Number of neighbors used in KNN density estimation for deep features. Defaults to
1.
- static configure_optimizers()#
AI-VAD training does not involve fine-tuning of NN weights, no optimizers needed.
- Return type:
None
- static configure_transforms(image_size=None)#
AI-VAD does not need a transform, as the region- and feature-extractors apply their own transforms.
- Return type:
Transform|None
- fit()#
Fit the density estimators to the extracted features from the training set.
- Return type:
None
- property learning_type: LearningType#
Return the learning type of the model.
- Returns:
Learning type of the model.
- Return type:
LearningType
- property trainer_arguments: dict[str, Any]#
AI-VAD specific trainer arguments.
- training_step(batch)#
Training Step of AI-VAD.
Extract features from the batch of clips and update the density estimators.
- Parameters:
batch (dict[str, str | torch.Tensor]) – Batch containing image filename, image, label and mask
- Return type:
None
- validation_step(batch, *args, **kwargs)#
Perform the validation step of AI-VAD.
Extract boxes and box scores..
- Parameters:
batch (dict[str, str | torch.Tensor]) – Input batch
*args – Arguments.
**kwargs – Keyword arguments.
- Return type:
Union[Tensor,Mapping[str,Any],None]- Returns:
Batch dictionary with added boxes and box scores.
PyTorch model for AI-VAD model implementation.
Paper https://arxiv.org/pdf/2212.00789.pdf
- class anomalib.models.video.ai_vad.torch_model.AiVadModel(box_score_thresh=0.8, persons_only=False, min_bbox_area=100, max_bbox_overlap=0.65, enable_foreground_detections=True, foreground_kernel_size=3, foreground_binary_threshold=18, n_velocity_bins=8, use_velocity_features=True, use_pose_features=True, use_deep_features=True, n_components_velocity=5, n_neighbors_pose=1, n_neighbors_deep=1)#
Bases:
ModuleAI-VAD model.
- Parameters:
box_score_thresh (float) – Confidence threshold for region extraction stage. Defaults to
0.8.persons_only (bool) – When enabled, only regions labeled as person are included. Defaults to
False.min_bbox_area (int) – Minimum bounding box area. Regions with a surface area lower than this value are excluded. Defaults to
100.max_bbox_overlap (float) – Maximum allowed overlap between bounding boxes. Defaults to
0.65.enable_foreground_detections (bool) – Add additional foreground detections based on pixel difference between consecutive frames. Defaults to
True.foreground_kernel_size (int) – Gaussian kernel size used in foreground detection. Defaults to
3.foreground_binary_threshold (int) – Value between 0 and 255 which acts as binary threshold in foreground detection. Defaults to
18.n_velocity_bins (int) – Number of discrete bins used for velocity histogram features. Defaults to
8.use_velocity_features (bool) – Flag indicating if velocity features should be used. Defaults to
True.use_pose_features (bool) – Flag indicating if pose features should be used. Defaults to
True.use_deep_features (bool) – Flag indicating if deep features should be used. Defaults to
True.n_components_velocity (int) – Number of components used by GMM density estimation for velocity features. Defaults to
5.n_neighbors_pose (int) – Number of neighbors used in KNN density estimation for pose features. Defaults to
1.n_neighbors_deep (int) – Number of neighbors used in KNN density estimation for deep features. Defaults to
1.
- forward(batch)#
Forward pass through AI-VAD model.
- Parameters:
batch (torch.Tensor) – Input image of shape (N, L, C, H, W)
- Returns:
List of bbox locations for each image. list[torch.Tensor]: List of per-bbox anomaly scores for each image. list[torch.Tensor]: List of per-image anomaly scores.
- Return type:
list[torch.Tensor]
Feature extraction module for AI-VAD model implementation.
- class anomalib.models.video.ai_vad.features.DeepExtractor#
Bases:
ModuleDeep feature extractor.
Extracts the deep (appearance) features from the input regions.
- forward(batch, boxes, batch_size)#
Extract deep features using CLIP encoder.
- Parameters:
batch (torch.Tensor) – Batch of RGB input images of shape (N, 3, H, W)
boxes (torch.Tensor) – Bounding box coordinates of shaspe (M, 5). First column indicates batch index of the bbox.
batch_size (int) – Number of images in the batch.
- Returns:
Deep feature tensor of shape (M, 512)
- Return type:
Tensor
- class anomalib.models.video.ai_vad.features.FeatureExtractor(n_velocity_bins=8, use_velocity_features=True, use_pose_features=True, use_deep_features=True)#
Bases:
ModuleFeature extractor for AI-VAD.
- Parameters:
n_velocity_bins (int) – Number of discrete bins used for velocity histogram features. Defaults to
8.use_velocity_features (bool) – Flag indicating if velocity features should be used. Defaults to
True.use_pose_features (bool) – Flag indicating if pose features should be used. Defaults to
True.use_deep_features (bool) – Flag indicating if deep features should be used. Defaults to
True.
- forward(rgb_batch, flow_batch, regions)#
Forward pass through the feature extractor.
Extract any combination of velocity, pose and deep features depending on configuration.
- Parameters:
rgb_batch (torch.Tensor) – Batch of RGB images of shape (N, 3, H, W)
flow_batch (torch.Tensor) – Batch of optical flow images of shape (N, 2, H, W)
regions (list[dict]) – Region information per image in batch.
- Returns:
Feature dictionary per image in batch.
- Return type:
list[dict]
- class anomalib.models.video.ai_vad.features.FeatureType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)#
Bases:
str,EnumNames of the different feature streams used in AI-VAD.
- class anomalib.models.video.ai_vad.features.PoseExtractor(*args, **kwargs)#
Bases:
ModulePose feature extractor.
Extracts pose features based on estimated body landmark keypoints.
- forward(batch, boxes)#
Extract pose features using a human keypoint estimation model.
- Parameters:
batch (torch.Tensor) – Batch of RGB input images of shape (N, 3, H, W)
boxes (torch.Tensor) – Bounding box coordinates of shaspe (M, 5). First column indicates batch index of the bbox.
- Returns:
list of pose feature tensors for each image.
- Return type:
list[torch.Tensor]
- class anomalib.models.video.ai_vad.features.VelocityExtractor(n_bins=8)#
Bases:
ModuleVelocity feature extractor.
Extracts histograms of optical flow magnitude and direction.
- Parameters:
n_bins (int) – Number of direction bins used for the feature histograms.
- forward(flows, boxes)#
Extract velocioty features by filling a histogram.
- Parameters:
flows (torch.Tensor) – Batch of optical flow images of shape (N, 2, H, W)
boxes (torch.Tensor) – Bounding box coordinates of shaspe (M, 5). First column indicates batch index of the bbox.
- Returns:
Velocity feature tensor of shape (M, n_bins)
- Return type:
Tensor
Regions extraction module of AI-VAD model implementation.
- class anomalib.models.video.ai_vad.regions.RegionExtractor(box_score_thresh=0.8, persons_only=False, min_bbox_area=100, max_bbox_overlap=0.65, enable_foreground_detections=True, foreground_kernel_size=3, foreground_binary_threshold=18)#
Bases:
ModuleRegion extractor for AI-VAD.
- Parameters:
box_score_thresh (float) – Confidence threshold for bounding box predictions. Defaults to
0.8.persons_only (bool) – When enabled, only regions labeled as person are included. Defaults to
False.min_bbox_area (int) – Minimum bounding box area. Regions with a surface area lower than this value are excluded. Defaults to
100.max_bbox_overlap (float) – Maximum allowed overlap between bounding boxes. Defaults to
0.65.enable_foreground_detections (bool) – Add additional foreground detections based on pixel difference between consecutive frames. Defaults to
True.foreground_kernel_size (int) – Gaussian kernel size used in foreground detection. Defaults to
3.foreground_binary_threshold (int) – Value between 0 and 255 which acts as binary threshold in foreground detection. Defaults to
18.
- forward(first_frame, last_frame)#
Perform forward-pass through region extractor.
- Parameters:
first_frame (torch.Tensor) – Batch of input images of shape (N, C, H, W) forming the first frames in the clip.
last_frame (torch.Tensor) – Batch of input images of shape (N, C, H, W) forming the last frame in the clip.
- Returns:
List of Mask RCNN predictions for each image in the batch.
- Return type:
list[dict]
- post_process_bbox_detections(regions)#
Post-process the region detections.
The region detections are filtered based on class label, bbox area and overlap with other regions.
- Parameters:
regions (list[dict[str, torch.Tensor]]) – Region detections for a batch of images, generated by the region extraction module.
- Returns:
Filtered regions
- Return type:
list[dict[str, torch.Tensor]]
- static subsample_regions(regions, indices)#
Subsample the items in a region dictionary based on a Tensor of indices.
- Parameters:
regions (dict[str, torch.Tensor]) – Region detections for a single image in the batch.
indices (torch.Tensor) – Indices of region detections that should be kept.
- Returns:
Subsampled region detections.
- Return type:
dict[str, torch.Tensor]
Optical Flow extraction module for AI-VAD implementation.
- class anomalib.models.video.ai_vad.flow.FlowExtractor(*args, **kwargs)#
Bases:
ModuleOptical Flow extractor.
Computes the pixel displacement between 2 consecutive frames from a video clip.
- forward(first_frame, last_frame)#
Forward pass through the flow extractor.
- Parameters:
first_frame (torch.Tensor) – Batch of starting frames of shape (N, 3, H, W).
last_frame (torch.Tensor) – Batch of last frames of shape (N, 3, H, W).
- Returns:
Estimated optical flow map of shape (N, 2, H, W).
- Return type:
Tensor
- pre_process(first_frame, last_frame)#
Resize inputs to dimensions required by backbone.
- Parameters:
first_frame (torch.Tensor) – Starting frame of optical flow computation.
last_frame (torch.Tensor) – Last frame of optical flow computation.
- Returns:
Preprocessed first and last frame.
- Return type:
tuple[torch.Tensor, torch.Tensor]
Density estimation module for AI-VAD model implementation.
- class anomalib.models.video.ai_vad.density.BaseDensityEstimator(*args, **kwargs)#
Bases:
Module,ABCBase density estimator.
- abstract fit()#
Compose model using collected features.
- Return type:
None
- forward(features)#
Update or predict depending on training status.
- Return type:
Tensor|tuple[Tensor,Tensor] |None
- abstract predict(features)#
Predict the density of a set of features.
- Return type:
Tensor|tuple[Tensor,Tensor]
- abstract update(features, group=None)#
Update the density model with a new set of features.
- Return type:
None
- class anomalib.models.video.ai_vad.density.CombinedDensityEstimator(use_pose_features=True, use_deep_features=True, use_velocity_features=False, n_neighbors_pose=1, n_neighbors_deep=1, n_components_velocity=5)#
Bases:
BaseDensityEstimatorDensity estimator for AI-VAD.
Combines density estimators for the different feature types included in the model.
- Parameters:
use_pose_features (bool) – Flag indicating if pose features should be used. Defaults to
True.use_deep_features (bool) – Flag indicating if deep features should be used. Defaults to
True.use_velocity_features (bool) – Flag indicating if velocity features should be used. Defaults to
False.n_neighbors_pose (int) – Number of neighbors used in KNN density estimation for pose features. Defaults to
1.n_neighbors_deep (int) – Number of neighbors used in KNN density estimation for deep features. Defaults to
1.n_components_velocity (int) – Number of components used by GMM density estimation for velocity features. Defaults to
5.
- fit()#
Fit the density estimation models on the collected features.
- Return type:
None
- predict(features)#
Predict the region- and image-level anomaly scores for an image based on a set of features.
- Parameters:
features (dict[Tensor]) – Dictionary containing extracted features for a single frame.
- Returns:
Region-level anomaly scores for all regions withing the frame. Tensor: Frame-level anomaly score for the frame.
- Return type:
Tensor
- update(features, group=None)#
Update the density estimators for the different feature types.
- Parameters:
features (dict[FeatureType, torch.Tensor]) – Dictionary containing extracted features for a single frame.
group (str) – Identifier of the video from which the frame was sampled. Used for grouped density estimation.
- Return type:
None
- class anomalib.models.video.ai_vad.density.GMMEstimator(n_components=2)#
Bases:
BaseDensityEstimatorDensity estimation based on Gaussian Mixture Model.
- Parameters:
n_components (int) – Number of components used in the GMM. Defaults to
2.
- fit()#
Fit the GMM and compute normalization statistics.
- Return type:
None
- predict(features, normalize=True)#
Predict the density of a set of feature vectors.
- Parameters:
features (torch.Tensor) – Input feature vectors.
normalize (bool) – Flag indicating if the density should be normalized to min-max stats of the feature bank. Defaults to
True.
- Returns:
Density scores of the input feature vectors.
- Return type:
Tensor
- update(features, group=None)#
Update the feature bank.
- Return type:
None
- class anomalib.models.video.ai_vad.density.GroupedKNNEstimator(n_neighbors)#
Bases:
DynamicBufferMixin,BaseDensityEstimatorGrouped KNN density estimator.
Keeps track of the group (e.g. video id) from which the features were sampled for normalization purposes.
- Parameters:
n_neighbors (int) – Number of neighbors used in KNN search.
- fit()#
Fit the KNN model by stacking the feature vectors and computing the normalization statistics.
- Return type:
None
- predict(features, group=None, n_neighbors=1, normalize=True)#
Predict the (normalized) density for a set of features.
- Parameters:
features (torch.Tensor) – Input features that will be compared to the density model.
group (str, optional) – Group (video id) from which the features originate. If passed, all features of the same group in the memory bank will be excluded from the density estimation. Defaults to
None.n_neighbors (int) – Number of neighbors used in the KNN search. Defaults to
1.normalize (bool) – Flag indicating if the density should be normalized to min-max stats of the feature bank. Defatuls to
True.
- Returns:
Mean (normalized) distances of input feature vectors to k nearest neighbors in feature bank.
- Return type:
Tensor
- update(features, group=None)#
Update the internal feature bank while keeping track of the group.
- Parameters:
features (torch.Tensor) – Feature vectors extracted from a video frame.
group (str) – Identifier of the group (video) from which the frame was sampled.
- Return type:
None