Self-Adversarial Multi-scale Contrastive Learning for Semantic Segmentation of Thermal Facial Images

Thermal imaging offers unique advantages for physiological monitoring and affective computing, but segmenting thermal facial images remains a formidable challenge. Unlike RGB images with prominent visual features, thermal images represent temperature distributions that are subtle, dynamic, and easily disrupted by occlusions like glasses or hair. How do we train robust segmentation networks when most existing approaches are designed for RGB images and large-scale thermal datasets are scarce?

Figure 1: Overview of our SAM-CL framework showing how thermal image augmentation (TiAug) creates realistic unconstrained scenarios while the SAM-CL loss function learns robust representations through self-adversarial training.

The Challenge: When RGB Methods Fall Short

Traditional semantic segmentation networks excel on RGB images with rich visual features and abundant training data. However, thermal facial images present unique challenges:

Subtle temperature variations across facial regions
Dynamic physiological changes affecting temperature patterns
Frequent occlusions from glasses, hair, and ambient objects
Limited datasets acquired mostly in controlled laboratory settings

Existing data augmentation techniques designed for RGB images (brightness, contrast adjustments) don’t translate meaningfully to thermal images, where temperature represents physical properties rather than illumination conditions.

Our Solution: SAM-CL Framework

We introduce the Self-Adversarial Multi-scale Contrastive Learning (SAM-CL) framework, consisting of two key innovations:

1. Thermal Image Augmentation (TiAug) Module

TiAug transforms controlled thermal images into realistic unconstrained scenarios by:

Synthesizing occluding objects with diverse thermal properties, shapes, and positions
Adding calibrated thermal noise based on camera sensitivity (e.g. noise equivalent temperature difference (NETD))
Creating realistic temperature distributions that break the bimodal histogram patterns, which are typical in controlled settings

Figure 2: TiAug module generates realistic thermal scenarios with varying ambient conditions and occlusions, moving beyond simple geometric transformations.

The augmentation is mathematically formulated as:

\[I_{aug}^{HxW} = f_{occ}(I_{org}^{HxW}, g(\vartheta_{sz}, \vartheta_{sh}, \vartheta_{temp}, \vartheta_{xy}, \vartheta_{config})) + \eta^{HxW}\]

where synthesized objects are characterized by size (\(\vartheta_{sz}\)), shape (\(\vartheta_{sh}\)), temperature (\(\vartheta_{temp}\)), position (\(\vartheta_{xy}\)), and configuration (\(\vartheta_{config}\)).

2. SAM-CL Loss Function

Rather than using pixel-level contrastive learning that becomes ineffective with limited datasets, we propose logit-level contrastive learning:

\[\mathscr{L}_{SAM-CL} = \mathscr{L}_{s0}(Y_{oh}, Y^{+}_{oh}, Y^{-}_{oh}) + \mathscr{L}_{s1}(Y_{Conv1}, Y^{+}_{Conv1}, Y^{-}_{Conv1}) + \mathscr{L}_{s2}(Y_{Conv2}, Y^{+}_{Conv2}, Y^{-}_{Conv2}) + \mathscr{L}_{s3}(Y_{Conv3}, Y^{+}_{Conv3}, Y^{-}_{Conv3})\]

The key innovation is using class-swapped masks (\(Y^-_{oh}\)) as negative samples, enabling effective inter-class separation while preserving spatial structure. Multi-scale supervision through an auxiliary 4-layer network ensures robust feature learning across different resolution levels.

Figure 3: Proposed SAM-CL Loss Function.

Experimental Results

We evaluated SAM-CL across multiple state-of-the-art segmentation networks on the Thermal Face Database:

Table 1: Performance improvements with SAM-CL framework

Network	Baseline (RMI)	SAM-CL	Improvement
U-Net	81.36%	82.11%	+0.76%
Attention U-Net	81.39%	82.85%	+1.35%
DeepLabV3	75.85%	79.29%	+3.44%
HRNetV2	78.46%	78.97%	+0.61%

Qualitative Analysis: Real-World Robustness

Figure 3: Qualitative comparison showing SAM-CL’s superior performance on unconstrained thermal images with occlusions and varying ambient conditions.

Testing on unconstrained datasets (UBComfort, DeepBreath) reveals SAM-CL’s remarkable generalization:

Handles eyeglasses occlusions effectively despite never seeing them during training
Adapts to different camera specifications (high-resolution vs mobile thermal cameras)
Maintains performance across varying ambient thermal conditions

Demo

Key Insights and Broader Impact

Why This Approach Works

Domain-Specific Design: TiAug successfully generates real-world variations, not just visual appearance
Self-Adversarial Learning: Creates challenging scenarios without requiring real unconstrained data
Multi-scale Supervision: Ensures robust features across different granularities
Logit-Level Contrastive Learning: More effective than pixel-level approaches for limited datasets

Applications Beyond Thermal Imaging

The SAM-CL framework’s principles extend to any domain with:

Limited training data
Significant domain gaps between controlled and real-world conditions
Need for robust feature learning across scales

Implementation and Reproducibility

Our framework is designed for easy integration:

# SAM-CL can be applied to any segmentation network
model = UNet()  # or AttentionUNet, DeepLabV3, etc.
optimizer = SAM_CL_Optimizer(model, sam_cl_loss, tiaug_module)

# Training with SAM-CL
for batch in dataloader:
    augmented_batch = tiaug_module(batch)
    predictions = model(augmented_batch)
    loss = sam_cl_loss(predictions, targets)
    optimizer.step()

Looking Forward

This work opens exciting directions for robust computer vision in challenging domains:

Medical imaging with limited annotated data
Satellite imagery with varying atmospheric conditions
Industrial inspection with changing environmental factors

The key insight is profound: when datasets are limited and domain gaps are large, task-specific augmentation paired with appropriate learning strategies can bridge the gap more effectively than simply scaling existing RGB-based approaches.

Source Code: Code is available for the research community to build upon this work.

Citation:

@inproceedings{Joshi_2022_BMVC,
author    = {Jitesh N Joshi and Nadia Berthouze and Youngjun Cho},
title     = {Self-adversarial Multi-scale Contrastive Learning for Semantic Segmentation of Thermal Facial Images},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {BMVA Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0864.pdf}
}

This work was conducted at the Department of Computer Science, University College London, under the supervision of Prof. Youngjun Cho. Jitesh Joshi was fully supported with international studentship that was secured by Prof. Cho

For questions or collaborations, please contact: jitesh.joshi.20@ucl.ac.uk