alphaXiv

History

Papers Benchmarks

Microsoft Corp.

140

01 Oct 2024

computer-science cryptography-and-security machine-learning

VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

University of Wisconsin-Madison Microsoft Corp.

Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods. Disclaimer: This paper may contain offensive examples; reader discretion is advised.

126

12 Mar 2025

computer-science programming-languages software-engineering

LLM-Guided Compositional Program Synthesis

Georgia Tech Microsoft Corp.

Program synthesis from input-output examples, also called programming by example (PBE), has had tremendous impact on automating end-user tasks. Large language models (LLMs) have the ability to solve PBE tasks by generating code in different target languages, but they can fail unpredictably. To recover for failure, most approaches, such as self-reflection, use the LLM to solve the same task, but with a richer context. We introduce a novel technique that recovers from failure by constructing simpler subtasks for the LLM to solve. Our approach performs compositional program synthesis using LLMs, where LLM not only guides the decomposition of the PBE task into subtasks, but also solves the subtasks. We present different strategies for decomposing the original task. We experimentally show that our approach can solve challenging task instances that are not solved by self-reflection alone.

17 Sep 2019

computer-science machine-learning sound

A scalable noisy speech dataset and online subjective test framework

Microsoft Corp.

Background noise is a major source of quality impairments in Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN) calls. Recent work shows the efficacy of deep learning for noise suppression, but the datasets have been relatively small compared to those used in other domains (e.g., ImageNet) and the associated evaluations have been more focused. In order to better facilitate deep learning research in Speech Enhancement, we present a noisy speech dataset (MS-SNSD) that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired. We show that increasing dataset sizes increases noise suppression performance as expected. In addition, we provide an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing, with a reference algorithm to normalize the results. To demonstrate the dataset and evaluation framework we apply it to several noise suppressors and compare the subjective Mean Opinion Score (MOS) with objective quality measures such as SNR, PESQ, POLQA, and VISQOL and show why MOS is still required. Our subjective MOS evaluation is the first large scale evaluation of Speech Enhancement algorithms that we are aware of.

29 Sep 2025

computer-science artificial-intelligence networking-and-internet-architecture

Distributed AI Platform for the 6G RAN

Microsoft Corp.

Cellular Radio Access Networks (RANs) are rapidly evolving towards 6G, driven by the need to reduce costs and introduce new revenue streams for operators and enterprises. In this context, AI emerges as a key enabler in solving complex RAN problems spanning both the management and application domains. Unfortunately, and despite the undeniable promise of AI, several practical challenges still remain, hindering the widespread adoption of AI applications in the RAN space. In this work, we attempt to shed light to these challenges and argue that existing approaches in addressing them are inadequate for realizing the vision of a truly AI-native 6G network. We propose a distributed AI platform architecture, tailored to the needs of an AI-native RAN.

01 Apr 2022

computer-science sound audio-and-speech-processing

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Technical University of Berlin Xidian University

Southern University of Science and Technology Indiana University Bloomington Microsoft Corp.Tencent Ethereal Audio Lab

With the advances in speech communication systems such as online conferencing applications, we can seamlessly work with people regardless of where they are. However, during online meetings, speech quality can be significantly affected by background noise, reverberation, packet loss, network jitter, etc. Because of its nature, speech quality is traditionally assessed in subjective tests in laboratories and lately also in crowdsourcing following the international standards from ITU-T Rec. P.800 series. However, those approaches are costly and cannot be applied to customer data. Therefore, an effective objective assessment approach is needed to evaluate or monitor the speech quality of the ongoing conversation. The ConferencingSpeech 2022 challenge targets the non-intrusive deep neural network models for the speech quality assessment task. We open-sourced a training corpus with more than 86K speech clips in different languages, with a wide range of synthesized and live degradations and their corresponding subjective quality scores through crowdsourcing. 18 teams submitted their models for evaluation in this challenge. The blind test sets included about 4300 clips from wide ranges of degradations. This paper describes the challenge, the datasets, and the evaluation methods and reports the final results.

20 Jan 2024

computer-science databases

Extending Polaris to Support Transactions

Microsoft Corp.

In Polaris, we introduced a cloud-native distributed query processor to perform analytics at scale. In this paper, we extend the underlying Polaris distributed computation framework, which can be thought of as a read-only transaction engine, to execute general transactions (including updates, deletes, inserts and bulk loads, in addition to queries) for Tier 1 warehousing workloads in a highly performant and predictable manner. We take advantage of the immutability of data files in log-structured data stores and build on SQL Server transaction management to deliver full transactional support with Snapshot Isolation semantics, including multi-table and multi-statement transactions. With the enhancements described in this paper, Polaris supports both query processing and transactions for T-SQL in Microsoft Fabric.

16 Apr 2021

computer-science sound audio-and-speech-processing

Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing

Technische Universität Berlin Microsoft Corp.

The quality of the speech communication systems, which include noise suppression algorithms, are typically evaluated in laboratory experiments according to the ITU-T Rec. P.835, in which participants rate background noise, speech signal, and overall quality separately. This paper introduces an open-source toolkit for conducting subjective quality evaluation of noise suppressed speech in crowdsourcing. We followed the ITU-T Rec. P.835, and P.808 and highly automate the process to prevent moderator's error. To assess the validity of our evaluation method, we compared the Mean Opinion Scores (MOS), calculate using ratings collected with our implementation, and the MOS values from a standard laboratory experiment conducted according to the ITU-T Rec P.835. Results show a high validity in all three scales namely background noise, speech signal and overall quality (average PCC = 0.961). Results of a round-robin test (N=5) showed that our implementation is also a highly reproducible evaluation method (PCC=0.99). Finally, we used our implementation in the INTERSPEECH 2021 Deep Noise Suppression Challenge as the primary evaluation metric, which demonstrates it is practical to use at scale. The results are analyzed to determine why the overall performance was the best in terms of background noise and speech quality.

24 Jan 2022

computer-science sound audio-and-speech-processing

Endpoint Detection for Streaming End-to-End Multi-talker ASR

Microsoft Corp.Otter.ai

Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped speech from conversations or meetings with an all-neural model in a streaming fashion, which is fundamentally different from a modular-based approach that usually cascades the speech separation and the speech recognition models trained independently. Previously, we proposed the Streaming Unmixing and Recognition Transducer (SURT) model based on recurrent neural network transducer (RNN-T) for this problem and presented promising results. However, for real applications, the speech recognition system is also required to determine the timestamp when a speaker finishes speaking for prompt system response. This problem, known as endpoint (EP) detection, has not been studied previously for multi-talker end-to-end models. In this work, we address the EP detection problem in the SURT framework by introducing an end-of-sentence token as an output unit, following the practice of single-talker end-to-end models. Furthermore, we also present a latency penalty approach that can significantly cut down the EP detection latency. Our experimental results based on the 2-speaker LibrispeechMix dataset show that the SURT model can achieve promising EP detection without significantly degradation of the recognition accuracy.

21 Nov 2016

computer-science machine-learning general-finance

Revenue Forecasting for Enterprise Products

Microsoft Corp.

For any business, planning is a continuous process, and typically business-owners focus on making both long-term planning aligned with a particular strategy as well as short-term planning that accommodates the dynamic market situations. An ability to perform an accurate financial forecast is crucial for effective planning. In this paper, we focus on providing an intelligent and efficient solution that will help in forecasting revenue using machine learning algorithms. We experiment with three different revenue forecasting models, and here we provide detailed insights into the methodology and their relative performance measured on real finance data. As a real-world application of our models, we partner with Microsoft's Finance organization (department that reports Microsoft's finances) to provide them a guidance on the projected revenue for upcoming quarters.

22 Feb 2025

computer-science artificial-intelligence computation-and-language

ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?

Tsinghua University

Columbia University Microsoft Corp.

Effective toxic content detection relies heavily on high-quality and diverse data, which serve as the foundation for robust content moderation models. Synthetic data has become a common approach for training models across various NLP tasks. However, its effectiveness remains uncertain for highly subjective tasks like hate speech detection, with previous research yielding mixed results. This study explores the potential of open-source LLMs for harmful data synthesis, utilizing controlled prompting and supervised fine-tuning techniques to enhance data quality and diversity. We systematically evaluated 6 open source LLMs on 5 datasets, assessing their ability to generate diverse, high-quality harmful data while minimizing hallucination and duplication. Our results show that Mistral consistently outperforms other open models, and supervised fine-tuning significantly enhances data reliability and diversity. We further analyze the trade-offs between prompt-based vs. fine-tuned toxic data synthesis, discuss real-world deployment challenges, and highlight ethical considerations. Our findings demonstrate that fine-tuned open source LLMs provide scalable and cost-effective solutions to augment toxic content detection datasets, paving the way for more accessible and transparent content moderation tools.

20 Sep 2023

computer-science machine-learning software-engineering

GrACE: Generation using Associated Code Edits

Microsoft Corp.

Developers expend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) of code with the knowledge of prior, relevant edits. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, Codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, the knowledge of prior edits boosts the performance of the LLMs significantly and enables them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.

22 Jun 2022

computer-science conversational-ai computation-and-language

GODEL: Large-Scale Pre-Training for Goal-Directed Dialog

Columbia University Microsoft Corp.

宝霖彭

We introduce GODEL (Grounded Open Dialogue Language Model), a large pre-trained language model for dialog. In contrast with earlier models such as DialoGPT, GODEL leverages a new phase of grounded pre-training designed to better support adapting GODEL to a wide range of downstream dialog tasks that require information external to the current conversation (e.g., a database or document) to produce good responses. Experiments against an array of benchmarks that encompass task-oriented dialog, conversational QA, and grounded open-domain dialog show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups, in terms of both human and automatic evaluation. A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses (extrinsic evaluation) in addition to their communicative features (intrinsic evaluation). We show that extrinsic evaluation offers improved inter-annotator agreement and correlation with automated metrics. Code and data processing scripts are publicly available.

29 Jul 2020

computer-science sound audio-and-speech-processing

DNN No-Reference PSTN Speech Quality Prediction

TU Berlin Microsoft Corp.

Classic public switched telephone networks (PSTN) are often a black box for VoIP network providers, as they have no access to performance indicators, such as delay or packet loss. Only the degraded output speech signal can be used to monitor the speech quality of these networks. However, the current state-of-the-art speech quality models are not reliable enough to be used for live monitoring. One of the reasons for this is that PSTN distortions can be unique depending on the provider and country, which makes it difficult to train a model that generalizes well for different PSTN networks. In this paper, we present a new open-source PSTN speech quality test set with over 1000 crowdsourced real phone calls. Our proposed no-reference model outperforms the full-reference POLQA and no-reference P.563 on the validation and test set. Further, we analyzed the influence of file cropping on the perceived speech quality and the influence of the number of ratings and training size on the model accuracy.

26 Feb 2024

computer-science sound audio-and-speech-processing

The ICASSP 2024 Audio Deep Packet Loss Concealment Challenge

Microsoft Corp.

Audio packet loss concealment is the hiding of gaps in VoIP audio streams caused by network packet loss. With the ICASSP 2024 Audio Deep Packet Loss Concealment Grand Challenge, we build on the success of the previous Audio PLC Challenge held at INTERSPEECH 2022. We evaluate models on an overall harder dataset, and use the new ITU-T P.804 evaluation procedure to more closely evaluate the performance of systems specifically on the PLC task. We evaluate a total of 9 systems, 8 of which satisfy the strict real-time performance requirements of the challenge, using both P.804 and Word Accuracy evaluations.

09 Oct 2023

agents computer-science information-retrieval

Augmented Embeddings for Custom Retrievals

Microsoft Corp.

Information retrieval involves selecting artifacts from a corpus that are most relevant to a given search query. The flavor of retrieval typically used in classical applications can be termed as homogeneous and relaxed, where queries and corpus elements are both natural language (NL) utterances (homogeneous) and the goal is to pick most relevant elements from the corpus in the Top-K, where K is large, such as 10, 25, 50 or even 100 (relaxed). Recently, retrieval is being used extensively in preparing prompts for large language models (LLMs) to enable LLMs to perform targeted tasks. These new applications of retrieval are often heterogeneous and strict -- the queries and the corpus contain different kinds of entities, such as NL and code, and there is a need for improving retrieval at Top-K for small values of K, such as K=1 or 3 or 5. Current dense retrieval techniques based on pretrained embeddings provide a general-purpose and powerful approach for retrieval, but they are oblivious to task-specific notions of similarity of heterogeneous artifacts. We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval. Adapted Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding. We empirically validate our approach by showing improvements over the state-of-the-art general-purpose embeddings-based baseline.

26 Mar 2021

computer-science computation-and-language audio-and-speech-processing

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Microsoft Corp.

Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion. In HAT, the blank probability and the label probability are estimated using two separate probability distributions, which provides a more accurate solution for internal LM score estimation, and thus works better when combining with an external LM. Previous work mainly focuses on HAT model training with the negative log-likelihood loss, while in this paper, we study the minimum word error rate (MWER) training of HAT -- a criterion that is closer to the evaluation metric for speech recognition, and has been successfully applied to other types of end-to-end models such as sequence-to-sequence (S2S) and RNN-T models. From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models, while at the same time, improving the robustness of the model against the decoding hyper-parameters such as length normalization and decoding beam during inference.

221

12 Mar 2025

computer-science computation-and-language information-retrieval

Ordered Semantically Diverse Sampling for Textual Data

Microsoft Corp.

Microsoft Corp researchers present an ordered semantically diverse sampling method to efficiently process large textual datasets with Large Language Models. Their PCA-based approach selects a small, maximally diverse subset of data, demonstrating superior diversity coverage and computational efficiency compared to existing methods across various text classification benchmarks.

16 Apr 2025

algebraic-topology combinatorics category-theory

The Topological Structures of the Orders of Hypergraphs

University of Pennsylvania Pacific Northwest National Laboratory American University Binghamton University (SUNY)Microsoft Corp.

The paper provides a comprehensive, functorial framework using category theory to map relationships between hypergraphs, formal contexts/concept lattices, and topological cosheaves of simplicial complexes. It establishes direct isomorphisms between hypergraph intersection complexes and concept lattices and introduces new Dowker cosheaves of chain complexes that serve as isomorphism invariants for concept lattices, extending prior homological results.

15 Oct 2021

computer-science computation-and-language audio-and-speech-processing

Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

University of Southern California Microsoft Corp.

Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR). In this paper, we aim to enhance the multilingual ASR performance in two ways, 1)studying the impact of feeding a one-hot vector identifying the language, 2)formulating the task with a meta-learning objective combined with self-supervised learning (SSL). We associate every language with a distinct task manifold and attempt to improve the performance by transferring knowledge across learning processes itself as compared to transferring through final model parameters. We employ this strategy on a dataset comprising of 6 languages for an in-domain ASR task, by minimizing an objective related to expected gradient path length. Experimental results reveal the best pre-training strategy resulting in 3.55% relative reduction in overall WER. A combination of LEAP and SSL yields 3.51% relative reduction in overall WER when using language ID.

25 Sep 2009

computer-science logic-in-computer-science

A Better Reduction Theorem for Store Buffers

German Research Center for Artificial Intelligence Microsoft Corp.

When verifying a concurrent program, it is usual to assume that memory is sequentially consistent. However, most modern multiprocessors depend on store buffering for efficiency, and provide native sequential consistency only at a substantial performance penalty. To regain sequential consistency, a programmer has to follow an appropriate programming discipline. However, na\"ive disciplines, such as protecting all shared accesses with locks, are not flexible enough for building high-performance multiprocessor software. We present a new discipline for concurrent programming under TSO (total store order, with store buffer forwarding). It does not depend on concurrency primitives, such as locks. Instead, threads use ghost operations to acquire and release ownership of memory addresses. A thread can write to an address only if no other thread owns it, and can read from an address only if it owns it or it is shared and the thread has flushed its store buffer since it last wrote to an address it did not own. This discipline covers both coarse-grained concurrency (where data is protected by locks) as well as fine-grained concurrency (where atomic operations race to memory). We formalize this discipline in Isabelle/HOL, and prove that if every execution of a program in a system without store buffers follows the discipline, then every execution of the program with store buffers is sequentially consistent. Thus, we can show sequential consistency under TSO by ordinary assertional reasoning about the program, without having to consider store buffers at all.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

LLM-Guided Compositional Program Synthesis

A scalable noisy speech dataset and online subjective test framework

Distributed AI Platform for the 6G RAN

ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications

Extending Polaris to Support Transactions

Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing

Endpoint Detection for Streaming End-to-End Multi-talker ASR

Revenue Forecasting for Enterprise Products

ToxiLab: How Well Do Open-Source LLMs Generate Synthetic Toxicity Data?

GrACE: Generation using Associated Code Edits

GODEL: Large-Scale Pre-Training for Goal-Directed Dialog

DNN No-Reference PSTN Speech Quality Prediction

The ICASSP 2024 Audio Deep Packet Loss Concealment Challenge

Augmented Embeddings for Custom Retrievals

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Ordered Semantically Diverse Sampling for Textual Data

The Topological Structures of the Orders of Hypergraphs

Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

A Better Reduction Theorem for Store Buffers

Events

AI for Law

Personalize Your Feed