Learning deeper models is usually a simple and effective approach to improve
model performance, but deeper models have larger model parameters and are more
difficult to train. To get a deeper model, simply stacking more layers of the
model seems to work well, but previous works have claimed that it cannot
benefit the model. We propose to train a deeper model with recurrent mechanism,
which loops the encoder and decoder blocks of Transformer in the depth
direction. To address the increasing of model parameters, we choose to share
parameters in different recursive moments. We conduct our experiments on WMT16
English-to-German and WMT14 English-to-France translation tasks, our model
outperforms the shallow Transformer-Base/Big baseline by 0.35, 1.45 BLEU
points, which is 27.23% of Transformer-Big model parameters. Compared to the
deep Transformer(20-layer encoder, 6-layer decoder), our model has similar
model performance and infer speed, but our model parameters are 54.72% of the
former.