Recently, Multimodal Large Language Models (MLLMs) have been used as agents
to control keyboard and mouse inputs by directly perceiving the Graphical User
Interface (GUI) and generating corresponding commands. However, current agents
primarily demonstrate strong understanding capabilities in static environments
and are mainly applied to relatively simple domains, such as Web or mobile
interfaces. We argue that a robust GUI agent should be capable of perceiving
temporal information on the GUI, including dynamic Web content and multi-step
tasks. Additionally, it should possess a comprehensive understanding of various
GUI scenarios, including desktop software and multi-window interactions. To
this end, this paper introduces a new dataset, termed GUI-World, which features
meticulously crafted Human-MLLM annotations, extensively covering six GUI
scenarios and eight types of GUI-oriented questions in three formats. We
evaluate the capabilities of current state-of-the-art MLLMs, including Image
LLMs and Video LLMs, in understanding various types of GUI content, especially
dynamic and sequential content. Our findings reveal that current models
struggle with dynamic GUI content without manually annotated keyframes or
operation history. On the other hand, Video LLMs fall short in all GUI-oriented
tasks given the sparse GUI video dataset. Therefore, we take the initial step
of leveraging a fine-tuned Video LLM, GUI-Vid, as a GUI-oriented assistant,
demonstrating an improved understanding of various GUI tasks. However, due to
the limitations in the performance of base LLMs, we conclude that using video
LLMs as GUI agents remains a significant challenge. We believe our work
provides valuable insights for future research in dynamic GUI content
understanding. All the dataset and code are publicly available at:
this https URL