Paper Summary
Foundation models have emerged as powerful tools for learning general-purpose representations from large, unlabeled datasets. In this study, we explore seismic foundation models (SFMs) based on Vision Transformers (ViTs) trained using masked autoencoding (MAE). We focus on how model scale, training data size, and fine-tuning strategies influence generalization and downstream performance. Our experiments encompass 2D and 3D ViT-MAE architectures, with 2D model sizes ranging from millions of parameters to the largest 3D models containing 1.8 billion parameters, pre-trained on a global corpus of seismic surveys covering 444,000 km².