The Transformer has emerged as one of the most advanced neural network architectures, with wide applications in large language models (LLMs), AI for Science, and image/video process. Despite its success, its mathematical foundations remain largely open. This research presents our recent progress toward addressing this gap, structured in two parts. First, we introduce a new perspective based on Petrov–Galerkin projection and Fourier analysis to better interpret the attention mechanism. Building on this framework, we propose a modified Transformer architecture that admits a clearer mathematical interpretation and exhibits a frequency-bootstrapping property. Second, drawing inspiration from direct sampling methods (DSMs) for inverse problems, we develop a novel feature generation approach: data features are constructed by solving PDEs and then incorporated into the attention mechanism. We demonstrate the proposed method on electrical impedance tomography (EIT), a prototypical severely ill-posed nonlinear inverse problem, which achieves superior accuracy over its predecessors and contemporary operator learners.
