In deep learning with PyTorch, efficient data handling is often the biggest bottleneck during model training. Poor data pipelines can slow down training dramatically—even more than model complexity or hardware limitations. PyTorch’s Dataset and DataLoader classes provide powerful, flexible abstractions to handle loading, preprocessing, batching, shuffling, augmentation, and multi-worker parallel loading seamlessly.
This comprehensive PyTorch DataLoader tutorial covers everything from basics to advanced optimisation techniques. You’ll learn how to use built-in datasets, create custom datasets, apply transforms for data augmentation, configure num_workers and pin_memory for maximum speed, and build scalable training pipelines.
Whether you’re training on MNIST, CIFAR-10, custom images, text data, or large-scale datasets, mastering torch.utils.data.DataLoader will accelerate your workflows and improve model performance.
Key Takeaways – PyTorch DataLoader Essentials
- Dataset + DataLoader simplify batching, shuffling, parallel loading, and GPU transfers in PyTorch.
- Built-in datasets (torchvision, torchtext) enable fast prototyping with MNIST, CIFAR-10/100, ImageNet, IMDB, etc.
- ImageFolder loads custom image datasets using folder structure for class labels automatically.
- Optimise performance with num_workers (multiprocessing), pin_memory (faster CPU-to-GPU), prefetch_factor, and drop_last.
- Transforms (from torchvision.transforms) handle resizing, normalization, augmentation (RandomCrop, RandomHorizontalFlip, etc.).
- Custom Dataset classes require only __len__() and __getitem__() for full control.
- Efficient data loading often determines real-world training speed and final accuracy more than architecture tweaks.
Prerequisites
- Python 3.8+ and PyTorch 2.0+ (install via pip install torch torchvision)
- Basic Python classes/OOP knowledge
- Optional: NVIDIA GPU with CUDA for best performance (test with torch.cuda.is_available())
Why Data Handling Matters in PyTorch Deep Learning
Most time in real deep learning projects goes into data: cleaning, preprocessing, loading, and augmenting. Bad data pipelines cause:
- CPU/GPU underutilization
- Slow epochs
- Out-of-memory errors
- Poor generalization
PyTorch Dataset stores samples + labels lazily (no full memory load). DataLoader wraps it into an iterable for easy batch iteration.
Built-in Datasets in torchvision and torchtext
torchvision provides computer vision datasets; torchtext handles NLP/text.
Popular torchvision datasets (2025-2026):
- MNIST / FashionMNIST — 28×28 grayscale images (digits / clothing), 60k train + 10k test
- CIFAR-10 / CIFAR-100 — 32×32 color images (10/100 classes: planes, cars, animals, etc.)
- ImageNet — 1.2M+ images, 1,000 classes (high-end hardware recommended)
- COCO — Object detection, segmentation, captions
- Others: EMNIST, STL10, SVHN, Kinetics-400 (video)
Loading example:
from torchvision import datasets
mnist = datasets.MNIST(root='./data', train=True, download=True)
cifar = datasets.CIFAR10(root='./data', train=True, download=True)
torchtext datasets:
- IMDB — 50k movie reviews for sentiment analysis
- WikiText-2 / WikiText-103 — Language modelling with Wikipedia text
Loading Custom Image Datasets with ImageFolder
For your own images, organise like this:
dataset/
├── class_apple/
│ ├── img1.jpg
│ └── img2.jpg
└── class_orange/
├── img1.jpg
└── img2.jpg
Load automatically:
from torchvision.datasets import ImageFolder
custom_dataset = ImageFolder(root='dataset/', transform=your_transforms)
Folder names become class labels (0=apple, 1=orange, etc.).
Understanding PyTorch DataLoader Parameters
Import:
from torch.utils.data import DataLoader
Core constructor:
dataloader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4, # Parallel subprocesses (best: ≈ CPU cores - 2)
pin_memory=True, # Faster CPU → GPU copy (use with CUDA)
drop_last=True, # Drop incomplete final batch
collate_fn=None, # Custom batch merging (e.g., padded sequences)
prefetch_factor=2 # Prefetch batches per worker (PyTorch 1.7+)
)
- batch_size: Samples per forward/backward pass (32–256 common; GPU memory limit)
- shuffle: True for training (prevents order bias); False for validation/test
- num_workers: 0 = main process only (slow); 4–16 typical on modern CPUs
- pin_memory: True + CUDA → pinned memory → non-blocking GPU transfers
- drop_last: Avoids uneven batches in training
Best practice (2025-2026): Set num_workers = os.cpu_count() // 2 or experiment; use pin_memory=True on GPU.
Practical Example: MNIST with DataLoader + Transforms
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)
# Iterate
for images, labels in train_loader:
# images.shape: [64, 1, 28, 28]
# labels.shape: [64]
break
GPU-Optimised Loading
device = "cuda" if torch.cuda.is_available() else "cpu"
kwargs = {'num_workers': 4, 'pin_memory': True} if device == 'cuda' else {}
train_loader = DataLoader(..., **kwargs)
Move model & data: model.to(device) and images, labels = images.to(device), labels.to(device)
PyTorch Transforms for Preprocessing & Augmentation
Chain in transforms.Compose:
transform = transforms.Compose([
transforms.Resize(224), # For models like ResNet
transforms.RandomCrop(224, padding=4),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # ImageNet stats
])
Common transforms:
- RandomCrop, CenterCrop, RandomResizedCrop
- RandomHorizontalFlip, RandomRotation
- Normalize, ToTensor
- ColorJitter (brightness/contrast/saturation)
CIFAR-10 Example with Visualization
import matplotlib.pyplot as plt
import numpy as np
# Same transform as above (adjusted for CIFAR)
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
images, labels = next(iter(trainloader))
imshow(torchvision.utils.make_grid(images))
Creating Custom Datasets in PyTorch
Subclass torch.utils.data.Dataset:
from torch.utils.data import Dataset
class SquareDataset(Dataset):
def __init__(self, a=0, b=1000):
self.a = a
self.b = b
def __len__(self):
return self.b - self.a + 1
def __getitem__(self, idx):
value = self.a + idx
return torch.tensor(value, dtype=torch.float), torch.tensor(value ** 2, dtype=torch.float)
dataset = SquareDataset(1, 10000)
loader = DataLoader(dataset, batch_size=128, shuffle=True)
For real files (e.g., CSV + images):
- Load paths/labels in __init__
- Read/augment in __getitem__ (lazy loading)
FAQ – Common PyTorch DataLoader Questions
- What is PyTorch DataLoader used for? Mini-batch loading, shuffling, parallel I/O, GPU optimisation.
- How to choose num_workers? Start with CPU cores – 2; monitor CPU usage; too high → overhead.
- When to use pin_memory=True? Always with CUDA; speeds up transfers significantly.
- Why is my DataLoader slow? Low num_workers, slow disk, heavy transforms → increase workers, use SSD, move cheap transforms to GPU if possible.
- How to handle variable-length data (e.g., text)? Custom collate_fn (pad sequences).
- drop_last=True or False? True for stable batch norms in training; False for full evaluation.
Summary
This PyTorch DataLoader guide equips you to build efficient, scalable data pipelines. Start with built-in datasets → add transforms → optimise with num_workers & pin_memory → create custom Dataset classes for production.
Mastering these tools reduces training time, prevents bottlenecks, and lets you focus on modelling.
Recommended Resources
- Official PyTorch Docs: Datasets & DataLoaders
- Writing Custom Datasets, DataLoaders, Transforms
- torch.utils.data API
- torchvision.transforms reference
- PyTorch Examples GitHub