RIKEN Center for Computational Science
Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 65,536 GPUs, achieving up to 4.1 exaFLOPS sustained throughput and 74--98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R2R^2 scores in the range of 0.98--0.99 against observational data.
This paper addresses emulation algorithms for matrix multiplication. General Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware architectures. The Ozaki scheme is a well-established GEMM-based emulation method for matrix multiplication, wherein input matrices are decomposed into several low-precision components to ensure that the resulting matrix product is computed exactly through numerical operations. This study proposes a novel GEMM-based emulation method for matrix multiplication that leverages the Chinese Remainder Theorem. The proposed method inherits the computational efficiency of highly optimized GEMM routines and further enables control over the number of matrix multiplications, which can enhance computational accuracy. We present numerical experiments featuring INT8 Tensor Core operations on GPUs and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64 emulation using the proposed method achieves performance levels of up to 7.4 to 9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200, exceeding the measured performance of native FP64 arithmetic. Furthermore, for FP64 computations on CPUs, the proposed method achieved up to a 2.3x speedup in emulating quadruple-precision arithmetic compared to the conventional Ozaki scheme.
We review lattice results related to pion, kaon, DD-meson, BB-meson, and nucleon physics with the aim of making them easily accessible to the nuclear and particle physics communities. More specifically, we report on the determination of the light-quark masses, the form factor f+(0)f_+(0) arising in the semileptonic KπK \to \pi transition at zero momentum transfer, as well as the decay-constant ratio fK/fπf_K/f_\pi and its consequences for the CKM matrix elements VusV_{us} and VudV_{ud}. We review the determination of the BKB_K parameter of neutral kaon mixing as well as the additional four BB parameters that arise in theories of physics beyond the Standard Model. For the heavy-quark sector, we provide results for mcm_c and mbm_b as well as those for the decay constants, form factors, and mixing parameters of charmed and bottom mesons and baryons. These are the heavy-quark quantities most relevant for the determination of CKM matrix elements and the global CKM unitarity-triangle fit. We review the status of lattice determinations of the strong coupling constant αs\alpha_s. We review the determinations of nucleon charges from the matrix elements of both isovector and flavour-diagonal axial, scalar and tensor local quark bilinears, and momentum fraction, helicity moment and the transversity moment from one-link quark bilinears. We also review determinations of scale-setting quantities. Finally, in this review we have added a new section on the general definition of the low-energy limit of the Standard Model.
Quantum synchronization (QS) in open many-body systems offers a promising route for controlling collective quantum dynamics, yet existing manipulation schemes often rely on dissipation engineering, which distorts limit cycles, lacks scalability, and is strongly system-dependent. Here, we propose a universal and scalable method for continuously tuning QS from maximal synchronization under isotropic interactions to complete synchronization blockade (QSB) under fully anisotropic coupling in spin oscillator networks. Our approach preserves intrinsic limit cycles and applies to both few-body and macroscopic systems. We analytically show that QS arises solely from spin flip-flop processes and their higher-order correlations, while anisotropic interactions induce non-synchronizing coherence. A geometric QS measure reveals a macroscopic QSB effect in the thermodynamic limit. The proposed mechanism is experimentally feasible using XYZ interactions and optical pumping, and provides a general framework for programmable synchronization control in complex quantum networks and dynamical phases of matter.
Researchers enhanced the Ozaki scheme to emulate FP64 matrix multiplication efficiently on hardware optimized for low-precision operations. Their work achieved up to 1.6 times faster throughput and higher accuracy compared to the baseline by optimizing matrix splitting and accumulation methods, specifically addressing the FP64 accumulation bottleneck.
Counterdiabatic (CD) protocols enable fast driving of quantum states by invoking an auxiliary adiabatic gauge potential (AGP) that suppresses transitions to excited states throughout the driving process. Usually, the full spectrum of the original unassisted Hamiltonian is a prerequisite for constructing the exact AGP, which implies that CD protocols are extremely difficult for many-body systems. Here, we apply a variational CD protocol recently proposed by P. W. Claeys et al. [Phys. Rev. Lett. 123, 090602 (2019)] to a two-component fermionic Hubbard model in one spatial dimension. This protocol engages an approximated AGP expressed as a series of nested commutators. We show that the optimal variational parameters in the approximated AGP satisfy a set of linear equations whose coefficients are given by the squared Frobenius norms of these commutators. We devise an exact algorithm that escapes the formidable iterative matrix-vector multiplications and evaluates the nested commutators and the CD Hamiltonian in analytic representations. We then examine the CD driving of the one-dimensional Hubbard model up to L=14L = 14 sites with driving order l3l \leqslant 3. Our results demonstrate the usefulness of the variational CD protocol to the Hubbard model and permit a possible route towards fast ground-state preparation for many-body systems.
To reduce the computational and memory overhead of Large Language Models, various approaches have been proposed. These include a) Mixture of Experts (MoEs), where token routing affects compute balance; b) gradual pruning of model parameters; c) dynamically freezing layers; d) dynamic sparse attention mechanisms; e) early exit of tokens as they pass through model layers; and f) Mixture of Depths (MoDs), where tokens bypass certain blocks. While these approaches are effective in reducing overall computation, they often introduce significant workload imbalance across workers. In many cases, this imbalance is severe enough to render the techniques impractical for large-scale distributed training, limiting their applicability to toy models due to poor efficiency. We propose an autonomous dynamic load balancing solution, DynMo, which provably achieves maximum reduction in workload imbalance and adaptively equalizes compute loads across workers in pipeline-parallel training. In addition, DynMo dynamically consolidates computation onto fewer workers without sacrificing training throughput, allowing idle workers to be released back to the job manager. DynMo supports both single-node multi-GPU systems and multi-node GPU clusters, and can be used in practical deployment. Compared to static distributed training solutions such as Megatron-LM and DeepSpeed, DynMo accelerates the end-to-end training of dynamic GPT models by up to 1.23x for MoEs, 3.18x for parameter pruning, 2.23x for layer freezing, 4.02x for sparse attention, 4.52x for early exit, and 1.17x for MoDs.
We present results for the nucleon form factors: electric (GEG_E), magnetic (GMG_M), axial (FAF_A), induced pseudoscalar (FPF_P) and pseudoscalar (GPG_P) form factors, using the second PACS10 ensemble that is one of three sets of 2+12+1 flavor lattice QCD configurations at physical quark masses in large spatial volumes (exceeding (10 fm)3(10\ \mathrm{fm})^3). The second PACS10 gauge configurations are generated by the PACS Collaboration with the six stout-smeared O(a)O(a) improved Wilson quark action and Iwasaki gauge action at the second gauge coupling β=2.00\beta=2.00 corresponding to the lattice spacing of a=0.063a=0.063 fm. We determine the isovector electric, magnetic and axial radii and magnetic moment from the corresponding form factors, as well as the axial-vector coupling gAg_A. Combining our previous results for the coarser lattice spacing [E. Shintani et al., Phys. Rev. D99 (2019) 014510; Phys. Rev. D102 (2020) 019902 (erattum)], the finite lattice spacing effects on the isovector radii, magnetic moment and axial-vector coupling are investigated using the difference between the two results. It was found that the effect on gAg_A is kept smaller than the statistical error of 2% while the effect on the isovector radii was observed as a possible discretization error of about 10%, regardless of the channel. We also report the partially conserved axial vector current (PCAC) relation using a set of nucleon three-point correlation functions in order to verify the effect by O(a)O(a)-improvement of the axial-vector current.
We conducted a systematic survey of emerging quantum-HPC platforms, which integrate quantum computers and High-Performance Computing (HPC) systems through co-location. Currently, it remains unclear whether such platforms provide tangible benefits for near-future industrial applications. To address this, we examined the impact of co-location on latency reduction, bandwidth enhancement, and advanced job scheduling. Additionally, we assessed how HPC-level capabilities could enhance hybrid algorithm performance, support large-scale error mitigation, and facilitate complex quantum circuit partitioning and optimization. Our findings demonstrate that co-locating quantum and HPC systems can yield measurable improvements in overall hybrid job throughput. We also observe that large-scale real-world problems can require HPC-level computational resources for executing hybrid algorithms.
We study systematic uncertainties in the lattice QCD computation of hadronic vacuum polarization (HVP) contribution to the muon g2g-2. We investigate three systematic effects; finite volume (FV) effect, cutoff effect, and integration scheme dependence. We evaluate the FV effect at the physical pion mass on two different volumes of (5.4 fm)4)^4 and (10.8 fm)4)^4 using the PACS10 configurations at the same cutoff scale. For the cutoff effect, we compare two types of lattice vector operators, which are local and conserved (point-splitting) currents, by varying the cutoff scale on a larger than (10 fm)4)^4 lattice at the physical point. For the integration scheme dependence, we compare the results between the coordinate- and momentum-space integration schemes at the physical point on a (10.8 fm)4)^4 lattice. Our result for the HVP contribution to the muon g2g-2 is given by aμhvp=737(9)(18+13)×1010a_\mu^{\rm hvp} = 737(9)(^{+13}_{-18})\times 10^{-10} in the continuum limit, where the first error is statistical and the second one is systematic.
Due to the low error tolerance of a qubit, detecting and correcting errors on it is essential for fault-tolerant quantum computing. Surface code (SC) associated with its decoding algorithm is one of the most promising quantum error correction (QEC) methods. % One of the challenges of QEC is its high complexity and computational demand. QEC needs to be very power-efficient since the power budget is limited inside of a dilution refrigerator for superconducting qubits by which one of the most successful quantum computers (QCs) is built. In this paper, we propose an online-QEC algorithm and its hardware implementation with SFQ-based superconducting digital circuits. We design a key building block of the proposed hardware with an SFQ cell library and evaluate it by the SPICE-level simulation. Each logic element is composed of about 3000 Josephson junctions and power consumption is about 2.78 uW when operating with 2 GHz clock frequency which meets the required decoding speed. Our decoder is simulated on a quantum error simulator for code distances 5 to 13 and achieves a 1.0% accuracy threshold.
Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through N:M sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semi-sparse ones. N:M sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for N:M sparsity. Meanwhile, NM-SpMM is proposed as an efficient general N:M sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general N:M sparsity) and 1.4x to 6.3x faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at this https URL
Periodically driven (Floquet) systems typically evolve toward an infinite-temperature thermal state due to continuous energy absorption. Before reaching equilibrium, however, they can transiently exhibit long-lived prethermal states that host exotic nonequilibrium phenomena, such as discrete time crystals (DTCs). In this study, we investigate the relaxation dynamics of periodically driven product states in a kicked Ising model implemented on the IBM Quantum Eagle and Heron processors. By using ancilla qubits to mediate interactions, we construct Kagome and Lieb lattices on superconducting qubits with heavy-hex connectivity. We identify two distinct types of noise-induced DTCs on Kagome and Lieb lattices, both arising from quantum noise in ancilla qubits. Type-I DTCs originate from robust boundary-mode period-doubling oscillations, stabilized by symmetry charge pumping, that are redistributed into the bulk due to ancilla noise. Type-II DTCs, in contrast, emerge in systems without charge-pumped qubits, where quantum noise unexpectedly stabilizes period-doubling oscillations that would otherwise rapidly decay. On the noisier Eagle device (ibm_kyiv), we observe both type-I and type-II DTCs on 53-qubit Kagome lattices with and without charge-pumped qubits, respectively. In contrast, on the lower-noise Heron device (ibm_marrakesh), period-doubling oscillations are confined to boundary-localized oscillations on 82-qubit Kagome and 40-qubit Lieb lattices, as redistribution into the bulk is suppressed. These experimental findings are supported by noisy matrix-product-state simulations, in which ancilla noise is modeled as random sign flips in the two-qubit gate rotation angles. Our results demonstrate that quantum noise in ancilla qubits can give rise to novel classes of prethermal dynamical phases, including boundary-protected and noise-induced DTCs.
We review lattice results related to pion, kaon, DD-meson, BB-meson, and nucleon physics with the aim of making them easily accessible to the nuclear and particle physics communities. More specifically, we report on the determination of the light-quark masses, the form factor f+(0)f_+(0) arising in the semileptonic KπK \to \pi transition at zero momentum transfer, as well as the decay constant ratio fK/fπf_K/f_\pi and its consequences for the CKM matrix elements VusV_{us} and VudV_{ud}. Furthermore, we describe the results obtained on the lattice for some of the low-energy constants of SU(2)L×SU(2)RSU(2)_L\times SU(2)_R and SU(3)L×SU(3)RSU(3)_L\times SU(3)_R Chiral Perturbation Theory. We review the determination of the BKB_K parameter of neutral kaon mixing as well as the additional four BB parameters that arise in theories of physics beyond the Standard Model. For the heavy-quark sector, we provide results for mcm_c and mbm_b as well as those for the decay constants, form factors, and mixing parameters of charmed and bottom mesons and baryons. These are the heavy-quark quantities most relevant for the determination of CKM matrix elements and the global CKM unitarity-triangle fit. We review the status of lattice determinations of the strong coupling constant αs\alpha_s. We consider nucleon matrix elements, and review the determinations of the axial, scalar and tensor bilinears, both isovector and flavor diagonal. Finally, in this review we have added a new section reviewing determinations of scale-setting quantities.
We performed a precise calculation of physical quantities related to the axial structure of the nucleon using 2+1 flavor lattice QCD gauge configuration (PACS10 configuration) generated at the physical point with lattice volume larger than (10  fm)4(10\;{\mathrm{fm}})^4 by the PACS Collaboration. The nucleon matrix element of the axial-vector current has two types of the nucleon form factors, the axial-vector (FAF_A) form factor and the induced pseudoscalar (FPF_P) form factor. Recently lattice QCD simulations have succeeded in reproducing the experimental value of the axial-vector coupling, gAg_A, determined from FA(q2)F_A(q^2) at zero momentum transfer q2=0q^2=0, at a percent level of statistical accuracy. However, the FPF_P form factor so far has not reproduced the experimental values well due to strong πN\pi N excited-state contamination. Therefore, we proposed a simple subtraction method for removing the so-called leading πN\pi N-state contribution, and succeeded in reproducing the values obtained by two experiments of muon capture on the proton and pion electro-production for FP(q2)F_P(q^2). The novel approach can also be applied to the nucleon pseudoscalar matrix element to determine the pseudoscalar (GPG_P) form factor with the help of the axial Ward-Takahashi identity. The resulting form factors, FP(q2)F_P(q^2) and GP(q2)G_P(q^2), are in good agreement with the prediction of the pion-pole dominance model. In the new analysis, the induced pseudoscalar coupling gPg_P^\ast and the pion-nucleon coupling gπNNg_{\pi NN} can be evaluated with a few percent accuracy including systematic uncertainties using existing data calculated at two lattice spacings.
We propose a method to construct a tensor network representation of partition functions without singular value decompositions nor series expansions. The approach is demonstrated for one- and two-dimensional Ising models and we study the dependence of the tensor renormalization group (TRG) on the form of the initial tensors and their symmetries. We further introduce variants of several tensor renormalization algorithms. Our benchmarks reveal a significant dependence of various TRG algorithms on the choice of initial tensors and their symmetries. However, we show that the boundary TRG technique can eliminate the initial tensor dependence for all TRG methods. The numerical results of TRG calculations can thus be made significantly more robust with only a few changes in the code. Furthermore, we study a three-dimensional Z2\mathbb{Z}_2 gauge theory without gauge-fixing and confirm the applicability of the initial tensor construction. Our method can straightforwardly be applied to systems with longer range and multi-site interactions, such as the next-nearest neighbor Ising model.
We theoretically investigate the Casimir effect originating from Dirac fields in finite-density matter under a magnetic field. In particular, we focus on quark fields in the magnetic dual chiral density wave (MDCDW) phase as a possible inhomogeneous ground state of interacting Dirac-fermion systems. In this system, the distance dependence of Casimir energy shows a complex oscillatory behavior by the interplay between the chemical potential, magnetic field, and inhomogeneous ground state. By decomposing the total Casimir energy into contributions of each Landau level, we elucidate what types of Casimir effects are realized from each Landau level: the lowest or some types of higher Landau levels lead to different behaviors of Casimir energies. Furthermore, we point out characteristic behaviors due to level splitting between different fermion flavors, i.e., up/down quarks. These findings provide new insights into Dirac-fermion (or quark) matter with a finite thickness.
This paper introduces a quantum framework for addressing reinforcement learning (RL) tasks, grounded in the quantum principles and leveraging a fully quantum model of the classical Markov decision process (MDP). By employing quantum concepts and a quantum search algorithm, this work presents the implementation and optimization of the agent-environment interactions entirely within the quantum domain, eliminating reliance on classical computations. Key contributions include the quantum-based state transitions, return calculation, and trajectory search mechanism that utilize quantum principles to demonstrate the realization of RL processes through quantum phenomena. The implementation emphasizes the fundamental role of quantum superposition in enhancing computational efficiency for RL tasks. Results demonstrate the capacity of a quantum model to achieve quantum enhancement in RL, highlighting the potential of fully quantum implementations in decision-making tasks. This work not only underscores the applicability of quantum computing in machine learning but also contributes to the field of quantum reinforcement learning (QRL) by offering a robust framework for understanding and exploiting quantum computing in RL systems.
Data Assimilation (DA) and Uncertainty quantification (UQ) are extensively used in analysing and reducing error propagation in high-dimensional spatial-temporal dynamics. Typical applications span from computational fluid dynamics (CFD) to geoscience and climate systems. Recently, much effort has been given in combining DA, UQ and machine learning (ML) techniques. These research efforts seek to address some critical challenges in high-dimensional dynamical systems, including but not limited to dynamical system identification, reduced order surrogate modelling, error covariance specification and model error correction. A large number of developed techniques and methodologies exhibit a broad applicability across numerous domains, resulting in the necessity for a comprehensive guide. This paper provides the first overview of the state-of-the-art researches in this interdisciplinary field, covering a wide range of applications. This review aims at ML scientists who attempt to apply DA and UQ techniques to improve the accuracy and the interpretability of their models, but also at DA and UQ experts who intend to integrate cutting-edge ML approaches to their systems. Therefore, this article has a special focus on how ML methods can overcome the existing limits of DA and UQ, and vice versa. Some exciting perspectives of this rapidly developing research field are also discussed.
We develop a new lattice gauge theory code set JuliaQCD using the Julia language. Julia is well-suited for integrating machine learning techniques and enables rapid prototyping and execution of algorithms for four dimensional QCD and other non-Abelian gauge theories. The code leverages LLVM for high-performance execution and supports MPI for parallel computations. Julia's multiple dispatch provides a flexible and intuitive framework for development. The code implements existing algorithms such as Hybrid Monte Carlo (HMC), many color and flavor, supports lattice fermions, smearing techniques, and full QCD simulations. It is designed to run efficiently across various platforms, from laptops to supercomputers, allowing for seamless scalability. The code set is currently available on GitHub this https URL.
There are no more papers matching your filters at the moment.