
Researchers at the Hong Kong Polytechnic University(PolyU) have built an AI agent that can understand hours-long videos by copying how humans think. The breakthrough could transform how artificial intelligence analyzes surveillance footage, sports broadcasts, and entertainment content. Current AI models struggle with videos longer than 15 minutes, but VideoMind processes content lasting nearly 30 minutes while using far less computing power than existing systems. The team has made their research open source and submitted findings to leading AI conferences worldwide.
How VideoMind Mimics Human Video Understanding?
VideoMind works by assigning four distinct roles that mirror human thought processes when watching videos. The Planner coordinates tasks for each question asked. The Grounder finds relevant moments in the footage. The Verifier double-checks information accuracy and picks the most reliable segments. Finally, the Answerer creates responses based on what it observed.
“Humans switch among different thinking modes when understanding videos: breaking down tasks, identifying relevant moments, revisiting these to confirm details, and synthesizing their observations into coherent answers,” said Prof. Changwen Chen, who leads the research team. The system uses a technique called Chain-of-Low-Rank Adaptation, which lets one AI model handle all four roles instead of running separate programs. This approach cuts down on memory and processing needs while maintaining accuracy.
VideoMind Outperforms Major AI Systems in Tests
The researchers tested VideoMind against top AI models, including GPT-4o and Gemini 1.5 Pr,o across 14 different benchmarks. VideoMind achieved better accuracy in challenging tasks involving videos that averaged 27 minutes long. Even the smaller version with 2 billion parameters matched the performance of much larger 7 billion parameter models from competitors.
The team designed VideoMind to address a major problem in AI development: excessive power consumption. While human brains use just 25 watts of power for complex video understanding, supercomputers with equivalent capabilities consume millions of times more energy. VideoMind’s efficient design offers a practical solution to this bottleneck. The framework builds on Qwen2-VL, an existing open-source model, making it accessible to researchers and companies without massive computing budgets. This democratizes advanced video analysis capabilities that were previously available only to tech giants with unlimited resources.
Future Applications Could Reshape Video Analysis
VideoMind represents more than just a technical achievement, it opens doors to practical applications across multiple industries. The researchers envision the technology powering intelligent surveillance systems that can spot unusual activities in real-time, sports analysis tools that break down game footage automatically, and video search engines that understand content rather than just relying on titles and tags.
Prof. Chen emphasized the broader impact: “VideoMind not only overcomes the performance limitations of AI models in video processing, but also serves as a modular, scalable and interpretable framework.” The open-source approach means developers worldwide can build upon this foundation, potentially accelerating innovation in video understanding across countless applications that haven’t been imagined yet.