People always desire an embodied agent that can perform a task by
understanding language instruction. Moreover, they also want to monitor and
expect agents to understand commands the way they expected. But, how to build
such an embodied agent is still unclear. Recently, people can explore this
problem with the Vision-and-Language Interaction benchmark ALFRED, which
requires an agent to perform complicated daily household tasks following
natural language instructions in unseen scenes. In this paper, we propose LEBP
-- Language Expectation and Binding Policy Module to tackle the ALFRED. The
LEBP contains a two-stream process: 1) It first conducts a language expectation
module to generate an expectation describing how to perform tasks by
understanding the language instruction. The expectation consists of a sequence
of sub-steps for the task (e.g., Pick an apple). The expectation allows people
to access and check the understanding results of instructions before the agent
takes actual actions, in case the task might go wrong. 2) Then, it uses the
binding policy module to bind sub-steps in expectation to actual actions to
specific scenarios. Actual actions include navigation and object manipulation.
Experimental results suggest our approach achieves comparable performance to
currently published SOTA methods and can avoid large decay from seen scenarios
to unseen scenarios.