Harvard Univ.
VideoITG introduces an instructed temporal grounding framework that improves video understanding by intelligently selecting relevant frames based on user queries. The framework consistently enhances the performance of Video-LLMs, demonstrating gains of up to 9.0% on long video benchmarks compared to uniform sampling, and enables smaller models to surpass larger ones.
87
·
The rapid advancement and deployment of AI systems have created an urgent need for standard safety-evaluation frameworks. This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. Its development employed an open process that included participants from multiple fields. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories, including violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (election, financial, health, legal). Our method incorporates a complete assessment standard, extensive prompt datasets, a novel evaluation framework, a grading and reporting system, and the technical as well as organizational infrastructure for long-term support and evolution. In particular, the benchmark employs an understandable five-tier grading scale (Poor to Excellent) and incorporates an innovative entropy-based system-response evaluation. In addition to unveiling the benchmark, this report also identifies limitations of our method and of building safety benchmarks generally, including evaluator uncertainty and the constraints of single-turn interactions. This work represents a crucial step toward establishing global standards for AI risk and reliability evaluation while acknowledging the need for continued development in areas such as multiturn interactions, multimodal understanding, coverage of additional languages, and emerging hazard categories. Our findings provide valuable insights for model developers, system integrators, and policymakers working to promote safer AI deployment.
We report on a search for heavy neutrinos (\nus\nus) produced in the decay Dsτ\nusD_s\to \tau \nus at the SPS proton target followed by the decay \nudecay\nudecay in the NOMAD detector. Both decays are expected to occur if \nus\nus is a component of ντ\nu_{\tau}.\ From the analysis of the data collected during the 1996-1998 runs with 4.1×10194.1\times10^{19} protons on target, a single candidate event consistent with background expectations was found. This allows to derive an upper limit on the mixing strength between the heavy neutrino and the tau neutrino in the \nus\nus mass range from 10 to 190 MeV\rm MeV. Windows between the SN1987a and Big Bang Nucleosynthesis lower limits and our result are still open for future experimental searches. The results obtained are used to constrain an interpretation of the time anomaly observed in the KARMEN1 detector.\
We present the results of a search for nu(mu)-->nu(e) oscillations in the NOMAD experiment at CERN. The experiment looked for the appearance of nu(e) in a predominantly nu(mu) wide-band neutrino beam at the CERN SPS. No evidence for oscillations was found. The 90% confidence limits obtained are delta m^2 < 0.4 eV^2 for maximal mixing and sin^2(2theta) < 1.4x10^{-3} for large delta m^2. This result excludes the LSND allowed region of oscillation parameters with delta m^2 > 10 eV^2.
Recent studies have revealed that selecting informative and relevant video frames can significantly improve the performance of Video Large Language Models (Video-LLMs). Current methods, such as reducing inter-frame redundancy, employing separate models for image-text relevance assessment, or utilizing temporal video grounding for event localization, substantially adopt unsupervised learning paradigms, whereas they struggle to address the complex scenarios in long video understanding. We propose Instructed Temporal Grounding for Videos (VideoITG), featuring customized frame sampling aligned with user instructions. The core of VideoITG is the VidThinker pipeline, an automated annotation framework that explicitly mimics the human annotation process. First, it generates detailed clip-level captions conditioned on the instruction; then, it retrieves relevant video segments through instruction-guided reasoning; finally, it performs fine-grained frame selection to pinpoint the most informative visual evidence. Leveraging VidThinker, we construct the VideoITG-40K dataset, containing 40K videos and 500K instructed temporal grounding annotations. We then design a plug-and-play VideoITG model, which takes advantage of visual language alignment and reasoning capabilities of Video-LLMs, for effective frame selection in a discriminative manner. Coupled with Video-LLMs, VideoITG achieves consistent performance improvements across multiple multimodal video understanding benchmarks, showing its superiority and great potentials for video understanding.
90
There are no more papers matching your filters at the moment.