Recent advancements in natural language-guided navigation have combined multimodal large language models with robotic tasks, particularly benefiting autonomous vehicles, robots, and drones. However, research in aerial drone navigation remains underdeveloped, with existing datasets suffering from unrealistic environments and limited scalability.
We introduce Nezha Platform and Nezha Dataset for aerial visual and language navigation tasks to address these issues. Developed with Unreal Engine 4 and Microsoft AirSim, Nezha Platform offers enhanced realism, versatile control, and efficient data pipelines. Our dataset features 14 diverse environments and 90 target objects, providing high-precision waypoint trajectories and multimodal sensor data. our task requires agents to navigate using end-point environment descriptions and orientation.
Our contributions include (1) A scalable platform for efficient data collection, (2) A comprehensive benchmark with extensive evaluation, and (3) A collaborative large-small model approach for accurate and efficient navigation. Our dataset, with 10K trajectories and 96 target objects, significantly advances aerial VLN research.
BibTex Code Here