NeurIPS 2025 Spotlight · Dataset & Benchmark

Fire360: Robust Perception & Episodic Memory in Degraded 360° Firefighting Video

A large-scale benchmark of 360° firefighter training videos that probes spatial grounding, temporal understanding, safety-critical reasoning, and transformation-invariant retrieval in the environments where reliability matters most.

Aditi Tiwari, Farzaneh Masoud, Dac Trong Nguyen, Jill Kraft, Heng Ji, Klara Nahrstedt
University of Illinois Urbana–Champaign · Illinois Fire Service Institute

228 videos · 50 hours · 5 tasks · 43.7-point human–model gap on TOR

Overview

Each year, tens of thousands of firefighters are injured in the line of duty, often when visibility is severely reduced and situational perception breaks down. Existing video benchmarks typically assume clean, forward-facing views or synthetic scenes, which makes it difficult to study model behavior in degraded 360° environments where decisions have direct safety implications.


Fire360 is constructed from professionally recorded firefighter training sessions with certified instructors and covers indoor rescues and outdoor operations across day, night, dense smoke, and post-fire overhaul. The benchmark defines five tasks (visual question answering, temporal captioning, object localization, safety-critical reasoning, and transformed object retrieval) in order to isolate where human experts retain robust understanding and state-of-the-art multimodal models still fail.

Dataset

Example frames from Fire360: outdoor day, night operation, zero visibility, indoor drills, charred scenes, and smoke obstruction.
Figure 1: Example Fire360 frames across outdoor day/night, zero-visibility, indoor charred scenes, and smoke-obstructed training environments. Replace this image with your actual montage.
Positioning
Comparison with existing video datasets

Fire360 complements prior video resources by jointly offering 360° views, egocentric and third-person perspectives, synchronized audio, and explicitly safety-critical events.

Dataset Third-Person 360° Egocentric Video Audio Real-world Safety-Critical Duration (s) Public
Ego4D 10,800,000
EPIC-Kitchens 712,800
360+x 244,800
HACS++ 500,400
Fire360 (Ours) 180,000
Content Distribution
Scene and action statistics
Donut charts for Fire360 scene categories (indoor vs outdoor day/night) and action categories with counts and percentages.
Figure 3: Scene distribution, action categories and object categories.

Fire360 balances indoor rescue/training (43.9%) with outdoor operations at night (28.5%) and outdoor operations during the day (27.6%), ensuring that models must handle both low-light interiors and open-air incident grounds. Eight safety-relevant action categories support temporal reasoning and analysis of model failure modes.

Benchmark Tasks

360° Visual Question Answering (VQA)
Spatial reasoning · Checklist queries

Models answer expert-authored questions about presence, visibility, and protocol adherence in single equirectangular frames (e.g., “Is a clear egress path visible through the smoke?”).

Metric: Top-1 accuracy
Temporal Action Captioning
Procedure summarization · 10–20s clips

Given short degraded clips, models must describe what firefighters are doing in natural language, aligned with standardized training procedures.

Metric: BLEU-4
Object Localization under Distortion
Detection · Smoke & geometric stretch

Category-agnostic localization of safety gear (SCBA, helmets, hoses) in 360° equirectangular frames, where poles exhibit strong geometric distortion.

Metric: Mean IoU
Safety-Critical Reasoning
Checklist-based compliance

Models decide whether a frame/clip is safe and justify their decision using a firefighter checklist (e.g., sealed gas mask, three-point contact on ladders).

Metric: Checklist accuracy
Transformed Object Retrieval (TOR)
Episodic memory · Core contribution

Given a clean exemplar of an object (helmet, hose, extinguisher) and an unpaired 360° scene after fire damage, the model must retrieve the degraded instance without seeing the transformation.

Metric: Top-1 retrieval accuracy

Transformed Object Retrieval (TOR)

Diagram of the TOR task: 360° frames, a reference catalogue of pristine objects, and retrieval of the degraded object region.
Figure 6: Illustration of TOR. A world model with episodic memory receives (i) 360° frames from a post-fire scene and (ii) a catalogue of pristine reference objects. Given a natural-language query such as “Locate the red firefighter helmet in the post-fire scene,” the model must mark the correct degraded helmet region.
Example
From pristine helmet to charred silhouette

In a typical example, the reference catalogue provides a clean side view of a red firefighter helmet. The target frame shows the same helmet after an interior room burn: the visor is blackened, reflections are gone, and surrounding walls have collapsed. Pipes, lamps, and other round shapes now compete as distractors.


Humans successfully retrieve the correct region in 83.5% of cases. GPT-4o achieves only 39.8%, often confusing pipes or background clutter for the helmet. CLIP and BLIP-2 perform worse, revealing that current models lack transformation-invariant object identity when temporal continuity is removed.

Results

View detailed tabular results
All Models
Performance across Fire360 tasks

For each task, the highlighted row indicates the best-performing model across all evaluated systems.

Model Score Human Metric
Task: Visual Question Answering (VQA)
GPT-4o 53.8% 91.4% Top-1 accuracy
Qwen-VL 47.2% 91.4% Top-1 accuracy
LLaVA-v1.5-13B 50.3% 91.4% Top-1 accuracy
BLIP-2 (OPT-6.7B) 42.7% 91.4% Top-1 accuracy
InstructBLIP 48.6% 91.4% Top-1 accuracy
Kosmos-2.5 47.5% 91.4% Top-1 accuracy
Task: Temporal Action Captioning
GLaMM-7B 0.341 0.85 BLEU-4
SwinBERT 0.315 0.85 BLEU-4
ProgressCaptioner 0.288 0.85 BLEU-4
Task: Object Localization under Distortion
Grounding DINO 38.4% 85.2% Mean IoU
OWLv2 39.8% 85.2% Mean IoU
YOLO-World 36.5% 85.2% Mean IoU
Task: Safety-Critical Reasoning
GPT-4o (prompted) 28.9% 94.6% Checklist accuracy
Qwen-VL 32.5% 94.6% Checklist accuracy
Claude-3 Sonnet 33.0% 94.6% Checklist accuracy
Llama-Guard-3-8B 27.4% 94.6% Checklist accuracy
Task: Transformed Object Retrieval (TOR)
GPT-4o 39.8% 83.5% Retrieval accuracy
CLIP (ViT-B/32) 32.5% 83.5% Retrieval accuracy
BLIP-2 (OPT-6.7B) 35.1% 83.5% Retrieval accuracy
CoLLM 35.7% 83.5% Retrieval accuracy
MCoT-RE 33.5% 83.5% Retrieval accuracy
Bar chart comparing VQA performance of GPT-4o, Qwen-VL, LLaVA-13B, BLIP-2, InstructBLIP, Kosmos-2.5, and human experts.
Visual Question Answering (VQA): The best model still lags human experts by 38–49 accuracy points, showing that basic situational questions remain unreliable under smoke and occlusion.
Bar chart comparing temporal action captioning performance of GLaMM-7B, SwinBERT, ProgressCaptioner, and human experts.
Temporal Action Captioning: Video captioning models capture rough activity but miss key safety-relevant details, with BLEU-4 gaps of roughly 0.5–0.56 to human experts.
Bar chart comparing object localization performance of Grounding DINO, OWLv2, YOLO-World, and human experts.
Object Localization under Distortion: Detectors lose nearly half of the mean IoU compared to humans when safety gear appears in distorted or smoke-obscured regions of the 360° frame.
Bar chart comparing safety-critical reasoning performance of GPT-4o, Claude-3 Sonnet, Llama-Guard-3-8B, Qwen-VL, and human experts.
Safety-Critical Reasoning: Dedicated safety and policy models still trail experts by 60–67 checklist-accuracy points, frequently missing violations obvious to trained instructors.
Bar chart comparing transformed object retrieval performance of GPT-4o, CLIP, BLIP-2, CoLLM, MCoT-RE, and human experts.
Transformed Object Retrieval (TOR): All evaluated models remain 44–51 points below human retrieval accuracy, confirming that current representations do not preserve object identity under severe fire damage.

Together, these plots and tables highlight a consistent pattern: Fire360 exposes large, degradation-driven failures across architectures and tasks, indicating that robust perception and episodic memory in 360° emergency scenes remain very open problems for current multimodal models.

Ethics and Bibtex

Ethics
Responsible release

All footage documents professional training drills with informed consent. No personally identifiable information is included; participants appear in protective gear. Fire360 is released for research on robust, safety-critical multimodal systems and may not be used for surveillance, profiling, or non-consensual monitoring.

Citation
BibTeX
@misc{tiwari2025fire360benchmarkrobustperception,
  title         = {Fire360: A Benchmark for Robust Perception and Episodic Memory
                   in Degraded 360-Degree Firefighting Videos},
  author        = {Aditi Tiwari and Farzaneh Masoud and Dac Trong Nguyen
                   and Jill Kraft and Heng Ji and Klara Nahrstedt},
  year          = {2025},
  eprint        = {2506.02167},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2506.02167},
}