National Tsing-Hua University
The Circular Electron-Positron Collider (CEPC), a proposed next-generation Higgs factory, provides new opportunities to explore physics beyond the Standard Model (SM). With its clean electron-positron collision environment and the ability to collect large samples of Higgs, W, and Z bosons, the CEPC enables precision measurements and searches for new physics. This white paper outlines the CEPC's discovery potential, including studies of exotic decays of the Higgs, Z, and top quarks, dark matter and dark sector phenomena, long-lived particles, supersymmetry, and neutrino-related signatures. Advanced detector technologies and reconstruction techniques, such as one-to-one correspondence reconstruction and jet origin identification, significantly improve sensitivity to rare and weakly interacting processes. The CEPC is particularly well suited to probe the electroweak phase transition and test models of electroweak baryogenesis and dark sector interactions. In addition, global fit analyses highlight the CEPC's complementary role in constraining a wide range of new physics scenarios. These features position the CEPC as a powerful tool for exploring the next frontier in fundamental particle physics in the post-Higgs discovery era.
A new jailbreaking technique, H-CoT, was developed at Duke University and Accenture to exploit the Chain-of-Thought (CoT) safety reasoning mechanisms of Large Reasoning Models. By injecting mocked execution-phase thoughts, H-CoT enabled models like OpenAI's o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking to generate harmful content that their built-in safety features would otherwise refuse.
Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE's superior performance and practical applicability for real-world deployments.
10
3D Gaussian Splatting (3DGS) has established itself as an efficient representation for real-time, high-fidelity 3D scene reconstruction. However, scaling 3DGS to large and unbounded scenes such as city blocks remains difficult. Existing divide-and-conquer methods alleviate memory pressure by partitioning the scene into blocks, but introduce new bottlenecks: (i) partitions suffer from severe load imbalance since uniform or heuristic splits do not reflect actual computational demands, and (ii) coarse-to-fine pipelines fail to exploit the coarse stage efficiently, often reloading the entire model and incurring high overhead. In this work, we introduce LoBE-GS, a novel Load-Balanced and Efficient 3D Gaussian Splatting framework, that re-engineers the large-scale 3DGS pipeline. LoBE-GS introduces a depth-aware partitioning method that reduces preprocessing from hours to minutes, an optimization-based strategy that balances visible Gaussians -- a strong proxy for computational load -- across blocks, and two lightweight techniques, visibility cropping and selective densification, to further reduce training cost. Evaluations on large-scale urban and outdoor datasets show that LoBE-GS consistently achieves up to 2×2\times faster end-to-end training time than state-of-the-art baselines, while maintaining reconstruction quality and enabling scalability to scenes infeasible with vanilla 3DGS.
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities. However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects (3D objects with temporal evolution over time). In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning. 4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks. With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs. The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding. 4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%. These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
6
EDELINE enhances diffusion-based world models by integrating linear-time State Space Models (Mamba) to process unbounded historical sequences, improving memory capacity and visual prediction fidelity. The unified architecture achieves a mean Human Normalized Score of 1.87 on Atari 100k and significantly outperforms prior models in memory-demanding environments like Crafter and ViZDoom, demonstrating more consistent and accurate long-term imagination.
5
EAMamba, an efficient all-around vision state space model, improves image restoration by reducing computational complexity by 31-89% and mitigating local pixel forgetting in Vision Mamba architectures. This framework achieves competitive or superior performance across denoising, super-resolution, deblurring, and dehazing tasks, making high-quality image processing more accessible and efficient.
5
Reconstructing high-quality 3D models from sparse 2D images has garnered significant attention in computer vision. Recently, 3D Gaussian Splatting (3DGS) has gained prominence due to its explicit representation with efficient training speed and real-time rendering capabilities. However, existing methods still heavily depend on accurate camera poses for reconstruction. Although some recent approaches attempt to train 3DGS models without the Structure-from-Motion (SfM) preprocessing from monocular video datasets, these methods suffer from prolonged training times, making them impractical for many applications. In this paper, we present an efficient framework that operates without any depth or matching model. Our approach initially uses SfM to quickly obtain rough camera poses within seconds, and then refines these poses by leveraging the dense representation in 3DGS. This framework effectively addresses the issue of long training times. Additionally, we integrate the densification process with joint refinement and propose a coarse-to-fine frequency-aware densification to reconstruct different levels of details. This approach prevents camera pose estimation from being trapped in local minima or drifting due to high-frequency signals. Our method significantly reduces training time from hours to minutes while achieving more accurate novel view synthesis and camera pose estimation compared to previous methods.
Kramers-Wannier duality, a hallmark of the Ising model, has recently gained renewed interest through its reinterpretation as a non-invertible symmetry with a state-level action. Using sequential quantum circuits (SQC), we argue that this duality governs the stability of quantum many-body scar (QMBS) states in a nonintegrable model, depending on whether the dual preserves the embedding conditions for scarring. This is supported by good agreement between first-order perturbation theory and numerics, which capture scar dynamics despite chaotic spectra. Our results establish non-invertible dualities as both a generative mechanism and a diagnostic tool for quantum many- body scarring, offering a generalized symmetry-based route to weak ergodicity breaking.
University of Washington logoUniversity of WashingtonUniversity of Toronto logoUniversity of TorontoUniversity of Amsterdam logoUniversity of AmsterdamCalifornia Institute of Technology logoCalifornia Institute of TechnologyUniversity of Illinois at Urbana-Champaign logoUniversity of Illinois at Urbana-ChampaignUniversity of Waterloo logoUniversity of WaterlooHarvard University logoHarvard UniversityNational Central UniversityNational Astronomical Observatory of JapanChinese Academy of Sciences logoChinese Academy of SciencesGoogle logoGoogleUniversity of Chicago logoUniversity of ChicagoUC Berkeley logoUC BerkeleyNational Taiwan Universitythe University of Tokyo logothe University of TokyoPeking University logoPeking UniversityMcGill University logoMcGill UniversityBoston University logoBoston UniversityNASA Goddard Space Flight Center logoNASA Goddard Space Flight CenterKorea Astronomy and Space Science InstituteUniversity of CologneRadboud UniversityUniversity of Maryland logoUniversity of MarylandInstitute for Advanced StudyStockholm University logoStockholm UniversityUniversity of Arizona logoUniversity of ArizonaUniversity of Massachusetts AmherstFermi National Accelerator LaboratoryUniversidad Complutense de MadridUniversity of Colorado BoulderThe Graduate University for Advanced Studies (SOKENDAI)KTH Royal Institute of Technology logoKTH Royal Institute of TechnologyChalmers University of Technology logoChalmers University of TechnologyOsaka Metropolitan UniversityUniversitat de ValènciaNational Radio Astronomy ObservatoryHiroshima UniversityKanazawa UniversityUniversidad Nacional Autónoma de MéxicoUniversity of the WitwatersrandNational Tsing-Hua UniversityAcademia Sinica Institute of Astronomy and AstrophysicsEast Asian ObservatoryNazarbayev UniversityInstituto Nacional de Astrofísica, Óptica y ElectrónicaInstituto de Astrofísica de Andalucía-CSICMax Planck Institute for Radio AstronomyINAF – Istituto di Astrofisica Spaziale e Fisica Cosmica MilanoINAF-Istituto di RadioastronomiaKagoshima UniversityUniversità degli Studi di CagliariJoint ALMA ObservatoryInstitut de Radioastronomie Millimétrique (IRAM)Japan Aerospace Exploration AgencySRON Netherlands Institute for Space ResearchMIT Haystack ObservatoryVillanova UniversityINAF- Osservatorio Astronomico di CagliariUniversity of Science and Technology, KoreaPolitecnico di BariUniversidad de ConcepciٞnShiv Nadar Institute of EminenceJoint Institute for VLBI ERIC (JIVE)Goethe-University, FrankfurtSquare Kilometre Array South Africa (SARAO)Istituto Nazionale di Fisica Nucleare INFNUniversit degli Studi di Napoli Federico IICenter for Astrophysics  Harvard & Smithsonian
The Event Horizon Telescope Collaboration conducted the first multi-epoch polarimetric imaging of M87* at event-horizon scales, observing a stable black hole shadow diameter while detecting substantial year-to-year variability in the ring's azimuthal brightness and linear polarization patterns, along with initial constraints on extended jet emission.
Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.
SAMEO, a framework built on EfficientSAM by researchers at National Tsing Hua University, NVIDIA, and Aeolus Robotics, adapts a foundation model for amodal instance segmentation, achieving state-of-the-art zero-shot performance on benchmarks like COCOA-cls and D2SA. The work also contributes Amodal-LVIS, a large-scale synthetic dataset of 300K images, significantly expanding available training data and addressing annotation quality issues.
Researchers from National Tsing Hua University, National Taiwan University, and RIKEN developed FROSS, a method that generates 3D Semantic Scene Graphs from RGB-D images online at 144.09 frames per second (7ms latency). This approach leverages direct lifting of 2D scene graphs to 3D space with Gaussian object representations, achieving high performance across key metrics on the 3DSSG and ReplicaSSG datasets while avoiding computationally intensive 3D reconstruction.
Kilonovae are the scientifically rich, but observationally elusive, optical transient phenomena associated with compact binary mergers. Only a handful of events have been discovered to date, all through multi-wavelength (gamma ray) and multi-messenger (gravitational wave) signals. Given their scarcity, it is important to maximise the discovery possibility of new kilonova events. To this end, we present our follow-up observations of the gravitational-wave signal, S250818k, a plausible binary neutron star merger at a distance of 237±62237 \pm 62 Mpc. Pan-STARRS tiled 286 and 318 square degrees (32% and 34% of the 90% sky localisation region) within 3 and 7 days of the GW signal, respectively. ATLAS covered 70% of the skymap within 3 days, but with lower sensitivity. These observations uncovered 47 new transients; however, none were deemed to be linked to S250818k. We undertook an expansive follow-up campaign of AT 2025ulz, the purported counterpart to S250818k. The griz-band lightcurve, combined with our redshift measurement (z=0.0849±0.0003z = 0.0849 \pm 0.0003) all indicate that SN 2025ulz is a SN IIb, and thus not the counterpart to S250818k. We rule out the presence of a AT 2017gfo-like kilonova within 27\approx 27% of the distance posterior sampled by our Pan-STARRS pointings (9.1\approx 9.1% across the total 90% three-dimensional sky localisation). We demonstrate that early observations are optimal for probing the distance posterior of the three-dimensional gravitational-wave skymap, and that SN 2025ulz was a plausible kilonova candidate for 5\lesssim 5 days, before ultimately being ruled out.
The empirical results have shown that firstly, with one-week holding period and reinvesting, for SSE Composite Index stocks, the highest p-ratio investment strategy produces the largest annualized rate of return; and for NYSE Composite Index stocks, all the three strategies with both one-week and one-month periods generate negative returns. Secondly, with non-reinvesting, for SSE Composite Index stocks, the highest p-ratio strategy with one-week holding period yields the largest annualized rate of return; and for NYSE Composite stocks, the one-week EEF strategy produces a medium annualized return. Thirdly, under the one-week EEF investment strategy, for NYSE Composite Index stocks, the right frontier yields a higher annualized return, but for SSE Composite Index stocks, the left frontier (stocks on the empirical efficient frontier) yields a higher annualized return than the right frontier. Fourthly, for NYSE Composite Index stocks, there is a positive linear relationship between monthly return and the p-index, but no such relationship is evident for SSE Composite Index stocks. Fifthly, for NYSE Composite Index stocks, the traditional five-factor model performs poorly, and adding the p-index as a sixth factor provides incremental information.
Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance in visual grounding, establishing themselves as a general interface for various vision-language applications. This progress has driven the development of token pruning methods to mitigate the high computational costs associated with processing numerous visual tokens. However, we observe that pruning significantly weakens the model's grounding ability, leading to incorrect predictions and drastic performance degradation. In Referring Expression Comprehension (REC), for instance, pruning causes the accuracy of LLaVA on the RefCOCO validation set to drop from 56.14% to 15.34%. Our analysis identifies misaligned position IDs after pruning as the primary cause of this degradation, as both the order and value of these IDs are crucial for maintaining performance in grounding tasks. To address this issue, we propose Grounding-Aware Token Pruning (GAP), a simple yet effective adjustment to position IDs that recovers REC accuracy back to 51.42%, which is 90% of the original performance in the without pruning setting, all while requiring no additional training, memory, or computational overhead. Applied to models such as Shikra, MiniGPTv2, and the LLaVA series, our method consistently improves performance across various token pruning strategies.
TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer, detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO provides computational efficiency and scalability, opening new possibilities for effective, automated prompt engineering in T2I tasks. We provide visual results, human preference report to investigate TIPO's effectiveness. Experimental evaluations on benchmark datasets demonstrate substantial improvements in aesthetic quality, significant reduction of visual artifacts, and enhanced alignment with target distributions along with significant human preference proficiency. These results highlight the importance of targeted prompt engineering in text-to-image tasks and indicate broader opportunities for automated prompt refinement.
Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.
There are no more papers matching your filters at the moment.