In image-assisted minimally invasive surgeries (MIS), understanding surgical
scenes is vital for real-time feedback to surgeons, skill evaluation, and
improving outcomes through collaborative human-robot procedures. Within this
context, the challenge lies in accurately detecting, segmenting, and estimating
the depth of surgical scenes depicted in high-resolution images, while
simultaneously reconstructing the scene in 3D and providing segmentation of
surgical instruments along with detection labels for each instrument. To
address this challenge, a novel Multi-Task Learning (MTL) network is proposed
for performing these tasks concurrently. A key aspect of this approach involves
overcoming the optimization hurdles associated with handling multiple tasks
concurrently by integrating a Adversarial Weight Update into the MTL framework,
the proposed MTL model achieves 3D reconstruction through the integration of
segmentation, depth estimation, and object detection, thereby enhancing the
understanding of surgical scenes, which marks a significant advancement
compared to existing studies that lack 3D capabilities. Comprehensive
experiments on the EndoVis2018 benchmark dataset underscore the adeptness of
the model in efficiently addressing all three tasks, demonstrating the efficacy
of the proposed techniques.