Despite many advances in deep-learning based semantic segmentation,
performance drop due to distribution mismatch is often encountered in the real
world. Recently, a few domain adaptation and active learning approaches have
been proposed to mitigate the performance drop. However, very little attention
has been made toward leveraging information in videos which are naturally
captured in most camera systems. In this work, we propose to leverage "motion
prior" in videos for improving human segmentation in a weakly-supervised active
learning setting. By extracting motion information using optical flow in
videos, we can extract candidate foreground motion segments (referred to as
motion prior) potentially corresponding to human segments. We propose to learn
a memory-network-based policy model to select strong candidate segments
(referred to as strong motion prior) through reinforcement learning. The
selected segments have high precision and are directly used to finetune the
model. In a newly collected surveillance camera dataset and a publicly
available UrbanStreet dataset, our proposed method improves the performance of
human segmentation across multiple scenes and modalities (i.e., RGB to Infrared
(IR)). Last but not least, our method is empirically complementary to existing
domain adaptation approaches such that additional performance gain is achieved
by combining our weakly-supervised active learning approach with domain
adaptation approaches.