Datasets#
This guide explains how datasets work in Anomalib, from the base implementation to specific dataset types and how to create your own dataset.
Base Dataset Structure#
Anomalib’s dataset system is built on top of PyTorch’s Dataset
class and uses pandas DataFrames to manage dataset samples. The base class AnomalibDataset
provides the foundation for all dataset implementations.
Core Components#
The dataset consists of three main components:
Samples DataFrame: The heart of each dataset is a DataFrame containing:
image_path
: Path to the image filesplit
: Dataset split (train/test/val)label_index
: Label index (0 for normal, 1 for anomalous)mask_path
: Path to mask file (for segmentation tasks)
Example DataFrame:
df = pd.DataFrame({ 'image_path': ['path/to/image.png'], 'label': ['anomalous'], 'label_index': [1], 'mask_path': ['path/to/mask.png'], 'split': ['train'] })
Transforms: Optional transformations applied to images
Task Type: Classification or Segmentation
Dataset Types#
Anomalib supports different types of datasets based on modality:
1. Image Datasets#
The most common type, supporting RGB images:
from anomalib.data.datasets import MVTecDataset
# Create MVTec dataset
dataset = MVTecDataset(
root="./datasets/MVTec",
category="bottle",
split="train"
)
# Access an item
item = dataset[0]
print(item.image.shape) # RGB image
print(item.gt_label.item()) # Label (0 or 1)
print(item.gt_mask.shape) # Segmentation mask (if available)
2. Video Datasets#
For video anomaly detection:
from anomalib.data.datasets import Avenue
# Create video dataset
dataset = AvenueDataset(
root="./datasets/avenue",
split="test",
transform=transform
)
# Access an item
item = dataset[0]
print(item.frames.shape) # Video frames
print(item.target_frame) # Frame number
3. Depth Datasets#
For RGB-D or depth-only data:
from anomalib.data.datasets import MVTec3DDataset
# Create depth dataset
dataset = MVTec3DDataset(
root="./datasets/MVTec3D",
category="bagel",
split="train",
)
# Access an item
item = dataset[0]
print(item.image.shape) # RGB image
print(item.depth_map.shape) # Depth map
Dataset Loading Process#
The dataset loading process follows these steps:
Initialization:
def __init__(self, transform=None): self.transform = transform self._samples = None self._category = None
Sample Collection:
@property def samples(self): if self._samples is None: raise RuntimeError("Samples DataFrame not set") return self._samples
Item Loading:
def __getitem__(self, index): sample = self.samples.iloc[index] image = read_image(sample.image_path) if self.transform: image = self.transform(image) return ImageItem( image=image, gt_label=sample.label_index )
Integration with Dataclasses#
Anomalib datasets are designed to work seamlessly with the dataclass system. When you access items from a dataset:
Single items are returned as Item objects (e.g.,
ImageItem
,VideoItem
,DepthItem
)When used with PyTorch’s DataLoader, items are automatically collated into Batch objects (e.g.,
ImageBatch
,VideoBatch
,DepthBatch
)
For example:
# Single item access returns an Item object
item = dataset[0] # Returns ImageItem
# DataLoader automatically creates Batch objects
dataloader = DataLoader(dataset, batch_size=32)
batch = next(iter(dataloader)) # Returns ImageBatch
See also
For more details on working with Item and Batch objects, see the dataclasses guide.
Creating Custom Datasets#
To create a custom dataset, extend the AnomalibDataset
class:
from anomalib.data.datasets.base import AnomalibDataset
from pathlib import Path
import pandas as pd
class CustomDataset(AnomalibDataset):
"""Custom dataset implementation."""
def __init__(
self,
root: Path | str = "./datasets/Custom",
category: str = "default",
transform = None,
split = None,
):
super().__init__(transform=transform)
# Set up dataset
self.root = Path(root)
self.category = category
self.split = split
# Create samples DataFrame
self.samples = self._make_dataset()
def _make_dataset(self) -> pd.DataFrame:
"""Create dataset samples DataFrame."""
samples_list = []
# Collect normal samples
normal_path = self.root / "normal"
for image_path in normal_path.glob("*.png"):
samples_list.append({
"image_path": str(image_path),
"label": "normal",
"label_index": 0,
"split": "train"
})
# Collect anomalous samples
anomaly_path = self.root / "anomaly"
for image_path in anomaly_path.glob("*.png"):
mask_path = anomaly_path / "masks" / f"{image_path.stem}_mask.png"
samples_list.append({
"image_path": str(image_path),
"label": "anomaly",
"label_index": 1,
"mask_path": str(mask_path),
"split": "test"
})
# Create DataFrame
samples = pd.DataFrame(samples_list)
samples.attrs["task"] = "segmentation"
return samples
Expected Directory Structure#
For the custom dataset above:
datasets/
└── Custom/
├── normal/
│ ├── 001.png
│ ├── 002.png
│ └── ...
└── anomaly/
├── 001.png
├── 002.png
└── masks/
├── 001_mask.png
├── 002_mask.png
└── ...
Best Practices#
Data Organization:
Keep consistent directory structure
Use clear naming conventions
Separate train/test splits
Validation:
Validate image paths exist
Ensure mask-image correspondence
Check label consistency
Performance:
Use appropriate data types
Implement efficient data loading
Cache frequently accessed data
Error Handling:
Provide clear error messages
Handle missing files gracefully
Validate input parameters
Common Pitfalls#
Path Issues:
Incorrect root directory
Missing mask files
Inconsistent file extensions
Data Consistency:
Mismatched image-mask pairs
Inconsistent image sizes
Wrong label assignments
Memory Management:
Loading too many images at once
Not releasing unused resources
Inefficient data structures
Transform Issues:
Incompatible transforms
Missing normalization
Incorrect transform order