Automatic dialogue coherence evaluation has attracted increasing attention
and is crucial for developing promising dialogue systems. However, existing
metrics have two major limitations: (a) they are mostly trained in a simplified
two-level setting (coherent vs. incoherent), while humans give Likert-type
multi-level coherence scores, dubbed as "quantifiable"; (b) their predicted
coherence scores cannot align with the actual human rating standards due to the
absence of human guidance during training. To address these limitations, we
propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel
framework aiming to train a quantifiable dialogue coherence metric that can
reflect the actual human rating standards. Specifically, QuantiDCE includes two
training stages, Multi-Level Ranking (MLR) pre-training and Knowledge
Distillation (KD) fine-tuning. During MLR pre-training, a new MLR loss is
proposed for enabling the model to learn the coarse judgement of coherence
degrees. Then, during KD fine-tuning, the pretrained model is further finetuned
to learn the actual human rating standards with only very few human-annotated
data. To advocate the generalizability even with limited fine-tuning data, a
novel KD regularization is introduced to retain the knowledge learned at the
pre-training stage. Experimental results show that the model trained by
QuantiDCE presents stronger correlations with human judgements than the other
state-of-the-art metrics.