Paper Summary
3D Seismic Foundation Models (SFMs) have been scaled to 1.8 billion parameters, pushing the boundaries of AI-driven seismic analysis. This work employs Vision Transformers (ViTs) augmented with multi-dimensional rotary positional embeddings and FlashAttention-2 to efficiently handle larger 3D spatial contexts. Pretraining was conducted on 20 terabytes of seismic data spanning 444,000 km² using a Masked Autoencoder (MAE) approach for self-supervised learning. Drawing on advancements in large model optimization, including key/query normalization and mixed precision techniques, the models achieved state-of-the-art generalization for salt segmentation tasks, with mean Intersection over Union (IoU) scores exceeding 0.9 across unseen datasets. Memory consumption analysis reveals a loglinear scaling relationship between model size, context size, and memory requirements. These advancements showcase the transformative potential of scaled SFMs in geophysical interpretation.