BBC Research and Development
Video frame interpolation is an increasingly important research task with several key industrial applications in the video coding, broadcast and production sectors. Recently, transformers have been introduced to the field resulting in substantial performance gains. However, this comes at a cost of greatly increased memory usage, training and inference time. In this paper, a novel method integrating a transformer encoder and convolutional features is proposed. This network reduces the memory burden by close to 50% and runs up to four times faster during inference time compared to existing transformer-based interpolation methods. A dual-encoder architecture is introduced which combines the strength of convolutions in modelling local correlations with those of the transformer for long-range dependencies. Quantitative evaluations are conducted on various benchmarks with complex motion to showcase the robustness of the proposed method, achieving competitive performance compared to state-of-the-art interpolation networks.
Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique for future wireless communication systems. Recent research demonstrates that RSMA can maintain its superiority without relying on Successive Interference Cancellation (SIC) receivers. In practical systems, SIC-free receivers are more attractive than SIC receivers because of their low complexity and latency. This paper evaluates the theoretical limits of RSMA with and without SIC receivers under finite constellations. We first derive the constellation-constrained rate expressions for RSMA. We then design algorithms based on projected subgradient ascent to optimize the precoders and maximize the weighted sum-rate or max-min fairness (MMF) among users. To apply the proposed optimization algorithms to large-scale systems, one challenge lies in the exponentially increasing computational complexity brought about by the constellation-constrained rate expressions. In light of this, we propose methods to avoid such computational burden. Numerical results show that, under optimized precoders, SIC-free RSMA leads to minor losses in weighted sum-rate and MMF performance in comparison to RSMA with SIC receivers, making it a viable option for future implementations.
Existing video captioning benchmarks and models lack causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics, enabling effective learning and generation of captions with causal-temporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models in articulating the causal and temporal aspects of video content: 17.88 and 17.44 CIDEr on the MSVD-CTN and MSRVTT-CTN datasets, respectively. Cross-dataset evaluations further showcase CEN's strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning. For project details, visit this https URL
In the context of Audio Visual Question Answering (AVQA) tasks, the audio visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.
Video Quality Assessment (VQA) is a very challenging task due to its highly subjective nature. Moreover, many factors influence VQA. Compression of video content, while necessary for minimising transmission and storage requirements, introduces distortions which can have detrimental effects on the perceived quality. Especially when dealing with modern video coding standards, it is extremely difficult to model the effects of compression due to the unpredictability of encoding on different content types. Moreover, transmission also introduces delays and other distortion types which affect the perceived quality. Therefore, it would be highly beneficial to accurately predict the perceived quality of video to be distributed over modern content distribution platforms, so that specific actions could be undertaken to maximise the Quality of Experience (QoE) of the users. Traditional VQA techniques based on feature extraction and modelling may not be sufficiently accurate. In this paper, a novel Deep Learning (DL) framework is introduced for effectively predicting VQA of video content delivery mechanisms based on end-to-end feature learning. The proposed framework is based on Convolutional Neural Networks, taking into account compression distortion as well as transmission delays. Training and evaluation of the proposed framework are performed on a user annotated VQA dataset specifically created to undertake this work. The experiments show that the proposed methods can lead to high accuracy of the quality estimation, showcasing the potential of using DL in complex VQA scenarios.
Existing datasets for manually labelled query-based video summarization are costly and thus small, limiting the performance of supervised deep video summarization models. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels to pre-train a supervised deep model. In this work, we introduce segment-level pseudo labels from input videos to properly model both the relationship between a pretext task and a target task, and the implicit relationship between the pseudo label and the human-defined label. The pseudo labels are generated based on existing human-defined frame-level labels. To create more accurate query-dependent video summaries, a semantics booster is proposed to generate context-aware query representations. Furthermore, we propose mutual attention to help capture the interactive information between visual and textual modalities. Three commonly-used video summarization benchmarks are used to thoroughly validate the proposed approach. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
Rate-Splitting Multiple Access (RSMA) has emerged as a novel multiple access technique that enlarges the achievable rate region of Multiple-Input Multiple-Output (MIMO) broadcast channels with linear precoding. In this work, we jointly address three practical but fundamental questions: (1) How to exploit the benefit of RSMA under finite constellations? (2) What are the potential and promising ways to implement RSMA receivers? (3) Can RSMA still retain its superiority in the absence of successive interference cancellers (SIC)? To address these concerns, we first propose low-complexity precoder designs taking finite constellations into account and show that the potential of RSMA is better achieved with such designs than those assuming Gaussian signalling. We then consider some practical receiver designs that can be applied to RSMA. We notice that these receiver designs follow one of two principles: (1) SIC: cancelling upper layer signals before decoding the lower layer and (2) non-SIC: treating upper layer signals as noise when decoding the lower layer. In light of this, we propose to alter the precoder design according to the receiver category. Through link-level simulations, the effectiveness of the proposed precoder and receiver designs are verified. More importantly, we show that it is possible to preserve the superiority of RSMA over Spatial Domain Multiple Access (SDMA), including SDMA with advanced receivers, even without SIC at the receivers. Those results therefore open the door to competitive implementable RSMA strategies for 6G and beyond communications.
Artificial Intelligence (AI) is becoming ubiquitous in domains such as medicine and natural science research. However, when AI systems are implemented in practice, domain experts often refuse them. Low acceptance hinders effective human-AI collaboration, even when it is essential for progress. In natural science research, scientists' ineffective use of AI-enabled systems can impede them from analysing their data and advancing their research. We conducted an ethnographically informed study of 10 in-depth interviews with AI practitioners and natural scientists at the organisation facing low adoption of algorithmic systems. Results were consolidated into recommendations for better AI adoption: i) actively supporting experts during the initial stages of system use, ii) communicating the capabilities of a system in a user-relevant way, and iii) following predefined collaboration rules. We discuss the broader implications of our findings and expand on how our proposed requirements could support practitioners and experts across domains.
We consider a downlink multicast and unicast superposition transmission in multi-layer Multiple-Input Multiple-Output (MIMO) Orthogonal Frequency Division Multiple Access (OFDMA) systems when only the statistical channel state information is available at the transmitter (CSIT). Multiple users can be scheduled by using the time/frequency resources in OFDMA, while for each scheduled user MIMO spatial multiplexing is used to transmit multiple information layers, i.e., single user (SU)-MIMO. The users only need to feedback to the base-station the rank-indicator and the long-term average channel signal-to-noise ratio to indicate a suitable number of transmission layers, a suitable modulation and coding scheme and allow the base-station to perform user scheduling. This approach is especially relevant for the delivery of common (e.g., popular live event) and independent (e.g., user personalized) content to a high number of users in deployments in the lower frequency bands operating in Frequency-Division-Duplex (FDD) mode, e.g., sub-1 GHz. We show that the optimal resource allocation that maximizes the ergodic sum-rate involves greedy user selection per OFDM subchannel and superposition transmission of one multicast signal across all subchannels and single unicast signal per subchannel. Degree-of-freedom (DoF) analysis shows that while the lack of instantaneous CSI limits DoF of unicast messages to the minimum number of transmit antennas and receiver antennas, the multicast message obtains full DoF that increases linearly with the number of users. We present resource allocation algorithms consisting of user selection and power allocation between multicast and unicast signals in each OFDM subchannel. System level simulations in 5G rural macro-cell scenarios show overall network throughput gains in realistic network environments by superposition transmission of multicast and unicast signals.
Interference widely exists in communication systems and is often not optimally treated at the receivers due to limited knowledge and/or computational burden. Evolutions of receivers have been proposed to balance complexity and spectral efficiency, for example, for 6G, while commonly used performance metrics, such as capacity and mutual information, fail to capture the suboptimal treatment of interference, leading to potentially inaccurate performance evaluations. Mismatched decoding is an information-theoretic tool for analyzing communications with suboptimal decoders. In this work, we use mismatched decoding to analyze communications with decoders that treat interference suboptimally, aiming at more accurate performance metrics. Specifically, we consider a finite-alphabet input Gaussian channel under interference, representative of modern systems, where the decoder can be matched (optimal) or mismatched (suboptimal) to the channel. The matched capacity is derived using Mutual Information (MI), while a lower bound on the mismatched capacity under various decoding metrics is derived using the Generalized Mutual Information (GMI). We show that the decoding metric in the proposed channel model is closely related to the behavior of the demodulator in Bit-Interleaved Coded Modulation (BICM) systems. Simulations illustrate that GMI/MI accurately predicts the throughput performance of BICM-type systems. Finally, we extend the channel model and the GMI to multiple antenna cases, with an example of multi-user multiple-input-single-output (MU-MISO) precoder optimization problem considering GMI under different decoding strategies. In short, this work discovers new insights about the impact of interference, proposes novel receivers, and introduces a new design and performance evaluation framework that more accurately captures the effect of interference.
Generating grammatically and semantically correct captions in video captioning is a challenging task. The captions generated from the existing methods are either word-by-word that do not align with grammatical structure or miss key information from the input videos. To address these issues, we introduce a novel global-local fusion network, with a Global-Local Fusion Block (GLFB) that encodes and fuses features from different parts of speech (POS) components with visual-spatial features. We use novel combinations of different POS components - 'determinant + subject', 'auxiliary verb', 'verb', and 'determinant + object' for supervision of the POS blocks - Det + Subject, Aux Verb, Verb, and Det + Object respectively. The novel global-local fusion network together with POS blocks helps align the visual features with language description to generate grammatically and semantically correct captions. Extensive qualitative and quantitative experiments on benchmark MSVD and MSRVTT datasets demonstrate that the proposed approach generates more grammatically and semantically correct captions compared to the existing methods, achieving the new state-of-the-art. Ablations on the POS blocks and the GLFB demonstrate the impact of the contributions on the proposed method.
Rate-Splitting Multiple Access (RSMA) has been recognized as a promising multiple access technique. We propose a novel architecture for downlink RSMA, namely Codeword-Segmentation RSMA (CS-RSMA). Different from conventional RSMA which splits users' messages into common and private parts before encoding, CS-RSMA encodes the users' messages directly, segments the codewords into common and private parts, and transmits the codeword segments using common and private streams. In addition to the principle of CS-RSMA, a novel performance analysis framework is proposed. This framework utilizes a recent discovery in mismatched decoding under finite-alphabet input and interference, and can better capture the receiver's complexity limits. Precoder optimization under finite alphabets and suboptimal decoders for conventional RSMA and CS-RSMA to maximize the Sum-Rate (SR) and the Max-Min Fairness (MMF) is also addressed. The numerical results reveal the theoretical performance of conventional RSMA and CS-RSMA. We observe that CS-RSMA leads to better performance than conventional RSMA in SR, and similar performance in MMF. Furthermore, a physical-layer implementation of CS-RSMA is proposed and evaluated through link-level simulations. Aside performance benefits, we also demonstrate that CS-RSMA brings significant benefits on the encoding/decoding, control signaling, and retransmission process compared to conventional RSMA.
There are no more papers matching your filters at the moment.