Robotic Systems LabETH Zurich logoETH Zurich
Small-scale dynamics and structure of free-surface turbulence
The dynamics of small-scale structures in free-surface turbulence is crucial to large-scale phenomena in natural and industrial environments. Here we conduct experiments on the quasi-flat free surface of a zero-mean-flow turbulent water tank over the Reynolds number range Reλ=207312Re_{\lambda} = 207\textrm{--}312. By seeding microscopic floating particles at high concentrations, the fine scales of the flow and the velocity gradient tensor are resolved. A kinematic relation is derived expressing the contribution of surface divergence and vorticity to the dissipation rate. The probability density functions of divergence, vorticity and strain-rate collapse once normalized by the Kolmogorov scales. Their magnitude displays strong intermittency and follows chi-square distributions with power-law tails at small values. The topology of high-intensity events and two-point statistics indicate that the surface divergence is characterized by dissipative spatial and temporal scales, while the high-vorticity and high-strain-rate regions are larger, long-lived, concurrent, and elongated. The second-order velocity structure functions obey the classic Kolmogorov scaling in the inertial range when the dissipation rate on the surface is considered, with a different numerical constant than in 3D turbulence. The cross-correlation among divergence, vorticity and strain-rate indicates that the surface-attached vortices are strengthened during downwellings and diffuse when those dissipate. Sources (sinks) in the surface velocity fields are associated with strong (weak) surface-parallel stretching and compression along perpendicular directions. The floating particles cluster over spatial and temporal scales larger than those of the sinks. These results demonstrate that, compared to 3D turbulence, in free-surface turbulence the energetic scales leave a stronger imprint on the small-scale quantities.
View blog
Resources
Efficient Tabular Data Preprocessing of ML Pipelines
23 Sep 2024

ETH Zurich researchers developed Piper, an FPGA-based hardware accelerator to address the CPU-GPU performance mismatch in machine learning pipelines by efficiently offloading stateful tabular data preprocessing. Piper achieved up to a 71.3x speedup over a 128-core CPU server and 20.3x over an Nvidia V100 GPU for binary input, significantly improving GPU utilization and reducing resource consumption.

View blog
Resources
Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Research from institutions including the UK AI Security Institute and Anthropic demonstrates that poisoning attacks on Large Language Models are determined by a near-constant absolute number of malicious samples, rather than a percentage of the total training data. As few as 250 poisoned documents were sufficient to backdoor models ranging from 600 million to 13 billion parameters, though subsequent alignment training significantly reduced attack success.

View blog
Resources
AnyUp: Universal Feature Upsampling
14 Oct 2025

AnyUp introduces a universal method for generating high-resolution feature maps from diverse low-resolution vision encoders without requiring model-specific retraining. The approach achieves state-of-the-art performance across various dense prediction tasks and generalizes robustly to unseen feature types and resolutions.

View blog
Resources4
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
12 Jun 2023

A large-scale and diverse benchmark, BIG-bench, was introduced to rigorously evaluate the capabilities and limitations of large language models across 204 tasks. The evaluation revealed that even state-of-the-art models currently achieve aggregate scores below 20 (on a 0-100 normalized scale), indicating significantly lower performance compared to human experts.

View blog
Resources
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting `this http URL` exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
View blog
Resources116
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
20 Oct 2025

Energy Matching presents a generative framework that unifies optimal transport flow matching with Energy-Based Models by learning a single, time-independent scalar potential. The method achieves state-of-the-art EBM performance with an FID of 3.34 on CIFAR-10, demonstrating competitive generation quality with leading diffusion models and enhanced capabilities for conditional generation and inverse problems.

View blog
Resources32
Graph of Thoughts: Solving Elaborate Problems with Large Language Models

Graph of Thoughts (GoT) introduces a novel prompting framework that models Large Language Model (LLM) reasoning as an arbitrary graph structure. This approach enables more flexible thought transformations like aggregation and refinement, leading to superior solution quality (e.g., 62% median error reduction in sorting) and improved cost-efficiency (e.g., >31% cost reduction) compared to previous state-of-the-art methods like Tree of Thoughts on elaborate tasks.

View blog
Resources2,280
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
29 Oct 2024

QuaRot introduces a method for end-to-end 4-bit quantization of Large Language Models, including weights, activations, and the KV cache, by implicitly removing outliers from activations through orthogonal transformations of the model's weights. This approach enabled LLAMA2-70B to achieve a perplexity of 3.79 with a 3.33x prefill speedup and 3.89x memory savings compared to the FP16 baseline.

View blog
Resources414
Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

Researchers from ETH Zurich developed the Robotic World Model (RWM), a framework that learns robust world models for complex robotic environments without domain-specific biases. This approach enables policies trained solely in imagination to be deployed onto physical quadrupedal and humanoid robots with zero-shot transfer, effectively bridging the sim-to-real gap for complex low-level control tasks.

View blog
Resources298
Defeating Prompt Injections by Design

Researchers from Google, Google DeepMind, and ETH Zurich introduced CaMeL, a system-level defense that secures Large Language Model (LLM) agents against prompt injection attacks by integrating traditional software security principles like control and data flow integrity and capabilities. This approach achieved 0 successful prompt injection attacks on the AgentDojo benchmark, significantly outperforming heuristic methods, while maintaining 77% task success.

View blog
Resources127
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

SceneSplat introduces a framework for open-vocabulary 3D scene understanding that natively operates on 3D Gaussian Splats, supported by the new large-scale SceneSplat-7K dataset. This approach achieves state-of-the-art zero-shot semantic segmentation, boosting f-mIoU by up to 10.4% on ScanNet++ benchmarks, while being 445.8 times faster for inference compared to prior methods.

View blog
Resources10
TreeRPO: Tree Relative Policy Optimization
27 Sep 2025

TREERPO enhances Large Language Model reasoning by employing a novel tree sampling mechanism to generate fine-grained, step-level reward signals without requiring a separate process reward model. This method improves Pass@1 accuracy by up to 16.5% for Qwen2.5-Math-1.5B and reduces average response length by 18.1% compared to GRPO.

View blog
Resources1
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

ETH Zurich researchers developed AgentDojo, a dynamic and extensible evaluation framework to measure the adversarial robustness of LLM agents against prompt injection attacks in realistic, tool-calling environments. The framework revealed that even highly capable LLMs struggle with complex benign tasks and are susceptible to prompt injection attacks, with more capable models often being easier to attack. While existing defenses show mixed results, simple tool isolation mechanisms proved most effective at mitigating attacks.

View blog
Resources
Attention-Based Map Encoding for Learning Generalized Legged Locomotion

An end-to-end learning framework integrates attention mechanisms into deep reinforcement learning to enable precise foothold selection and robust locomotion for legged robots on sparse terrains. The system allowed quadrupedal and humanoid robots to successfully traverse complex obstacle courses, showing higher success rates and better velocity tracking than previous methods.

View blog
Resources
Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography
30 Sep 2025
Advancements in medical imaging AI, particularly in 3D imaging, have been limited due to the scarcity of comprehensive datasets. We introduce CT-RATE, a public dataset that pairs 3D medical images with corresponding textual reports. CT-RATE comprises 25,692 non-contrast 3D chest CT scans from 21,304 unique patients. Each scan is accompanied by its corresponding radiology report. Leveraging CT-RATE, we develop CT-CLIP, a CT-focused contrastive language-image pretraining framework designed for broad applications without the need for task-specific training. We demonstrate how CT-CLIP can be used in multi-abnormality detection and case retrieval, and outperforms state-of-the-art fully supervised models across all key metrics. By combining CT-CLIP's vision encoder with a pretrained large language model, we create CT-CHAT, a vision-language foundational chat model for 3D chest CT volumes. Finetuned on over 2.7 million question-answer pairs derived from the CT-RATE dataset, CT-CHAT underscores the necessity for specialized methods in 3D medical imaging. Collectively, the open-source release of CT-RATE, CT-CLIP, and CT-CHAT not only addresses critical challenges in 3D medical imaging but also lays the groundwork for future innovations in medical AI and improved patient care.
View blog
Resources238
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using them as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.
View blog
Resources
Object-Centric Learning with Slot Attention

The paper introduces Slot Attention, an architectural module designed to extract object-centric representations from raw perceptual inputs. This module enables efficient unsupervised object discovery and supervised set prediction, demonstrating strong generalization to varying object counts and achieving competitive performance with significant computational efficiency improvements over prior methods.

View blog
Resources89
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
04 Sep 2025

Researchers from ETH Zurich and INSAIT conducted the first evaluation of large language models on generating rigorous natural language proofs for problems from the 2025 USA Mathematical Olympiad. The study found that most state-of-the-art models scored below 5% of the maximum points, with the highest-performing model, GEMINI-2.5-PRO, achieving only 24.4%, demonstrating fundamental shortcomings in advanced mathematical reasoning capabilities.

View blog
Resources172
Generalized Interpolating Discrete Diffusion

The Generalized Interpolating Discrete Diffusion (GIDD) framework introduces a flexible theoretical foundation for discrete diffusion models, demonstrating that hybrid noise (masking and uniform) enables self-correction abilities in text generation. Models trained with this approach achieve superior generative sample quality, with a BASE model (p_u=0.2) improving generative perplexity from 214 to 93.3, and also reaching state-of-the-art compute-matched perplexity for mask-only diffusion language models (22.29 PPL).

View blog
Resources3
There are no more papers matching your filters at the moment.