Forum Review|The China-Southeast Asia Kerry Young Scientist Forum & Guanlan Large Model Summit was successfully hosted by our institute(II)

发布者：汤靖玲发布时间：2025-11-12浏览次数：29

On November 8, 2025, the China-Southeast Asia Kerry Young Scientist Forum and Guanlan Large Model Summit continued with its second-day agenda, focusing on unified multimodal systems, trustworthy evaluation, intelligent agents and system implementation, software engineering, and industry applications. Young scholars and industry experts from the Institute of Automation, Chinese Academy of Sciences, Nanjing University of Posts and Telecommunications, University of Science and Technology of China, Harbin Institute of Technology, Fudan University, Zhejiang University, Beihang University, and China Telecom shared their latest research progress. Topics ranged from how large models can see more accurately, think more clearly, and act more reliably to the synergistic evolution of toolchains, evaluation systems, and engineering workflows, showcasing a series of methodologies and practical pathways. Shan Caifeng, Deputy Dean of the School of Intelligent Science and Technology at Nanjing University, attended the forum and delivered remarks. The sessions were chaired by faculty members from the school, including Associate Professor Ji Wei and Assistant Professors Liu Jiaheng, Wang Boyan, and Zhang Zhen.

On behalf of the School of Intelligent Science and Technology, Shan Caifeng extended a warm welcome to all the attending experts and scholars. He noted that since its inauguration, the Suzhou Campus—as the youngest campus of Nanjing University—has consistently prioritized the development of emerging engineering disciplines. As one of the first four schools established there, the School of Intelligent Science and Technology focuses on cutting-edge fields such as perceptual and visual intelligence, embodied and spatial intelligence, applied and scientific intelligence, cognitive and decision-making intelligence, as well as hybrid and brain-inspired intelligence.He highlighted that the school has now gathered nearly 50 outstanding faculty members, demonstrating significant progress in talent development. He sincerely invited the participating experts to engage in deeper collaboration with the school in the future, working together to contribute to the development of a more reliable, compliant, and efficient ecosystem for large models.

Expert Presentations

Chang-Sheng Xu
Distinguished Researcher (Institute of Automation, Chinese Academy of Sciences)
Multimodal Large Models in Open-World: Research and Applications

Faced with the challenges of large models in understanding complex open-world scenarios, Xu Changsheng emphasized the need to rethink the relationship between vision and language. He pointed out that mainstream approaches often forcibly align visual information into linguistic space, resulting in significant loss of visual details. To address this, he presented Libra, a systematic solution centered on deploying a decoupled visual system atop large language models.This framework follows three core principles: First, it establishes independent visual expert modules to preserve the integrity and uniqueness of visual information. Second, it designs efficient cross-modal conditional routing, enabling vision and language to interact on demand rather than being rigidly bound. Third, it enhances the visual system's own representational capacity through unified discrete autoregressive pre-training.At the application level, to tackle common issues like data distribution shifts and scarce annotations in downstream tasks, the team introduced innovative transfer learning strategies. On one hand, they adopted a curriculum learning-based unsupervised transfer approach, allowing the model to start learning from simple samples and gradually adapt to complex scenarios. On the other hand, they developed a few-shot adaptation method combining prompt tuning and knowledge distillation, enabling efficient task adaptation while retaining pre-trained knowledge.Experiments demonstrate that this series of methods achieves superior performance in tasks such as visual grounding and detailed recognition, and exhibits robust practical capabilities across multiple public datasets and real-world scenarios like industrial quality inspection.

Bingkun Bao
Dean (School of Computer Science / School of Software / School of Cyberspace Security, Nanjing University of Posts and Telecommunications)
Video Action Understanding for Industrial Operation Scenarios

Bingkun Bao highlighted the unique challenges of video analysis in industrial settings: unlike conventional videos, industrial footage often involves frequent camera movement, complex human-machine interaction, lengthy processes, and a strong reliance on long-term temporal understanding. To address these challenges, his team developed an industrial-grade video analysis framework structured around a content localization – anomaly detection – future prediction pipeline.For content localization, the team proposed a human-like two-stage localization strategy. This approach first performs coarse-grained instance mining through feature disentanglement, followed by fine-grained boundary localization enhanced by cross-modal interaction.In anomaly detection, to tackle the subtle nature of industrial anomalies (such as incorrect part assembly or operational position deviations), the team innovatively integrated textual operation manuals as external knowledge, fusing it with video content through cross-modal techniques. This significantly improved the model's ability to identify fine-grained anomalies.For future action prediction, considering the strong intentionality and real-time requirements of industrial operations, the team introduced an intent-aware + adaptive multi-modal sampling method. This approach effectively guides prediction while substantially reducing computational overhead.This series of solutions has been validated on public datasets like MAD and in real-world factory environments.

Xiaojun Chang
Professor (University of Science and Technology of China)
Multimodal Large Models: From Cross-Modal Understanding to Generation and Reasoning

Xiaojun Chang systematically outlined a unified blueprint for multimodal intelligence that progresses from comprehension and decision-making to integrated generation. He articulated that the development of multimodal intelligence must evolve through three key phases: isolated perception → dynamic cognition → collaborative sensing.Within this framework, his team has achieved significant progress along the complete pipeline of understanding → fusion → reasoning → generation:

At the understanding level, models like HC-LLM and ProAgent have enabled more precise temporal understanding and more efficient multi-agent collaboration.
In the fusion stage, technologies such as RealignDiff, Visual RAG, and LongVLM have realized coarse-to-fine cross-modal semantic alignment and in-depth comprehension of long videos.
For reasoning and generation, the team introduced Ground-R1 reinforcement learning to construct evidential reasoning chains, developed TGS-Agent for a traceable reason-first, segment-later workflow, and utilized StoryAgent to ensure narrative consistency and logical coherence in video generation.

Collectively, these contributions map out a practical pathway toward unified multimodal intelligence that both understands and generates, advancing the field from merely seeing the world to actively participating in it.

Xiang Wang
Professor (University of Science and Technology of China)
MiniOneRec: An Open-Source Generative Recommendation Framework

Xiang Wang articulated that recommendation systems are transitioning from the traditional cascade paradigm of retrieval → coarse-ranking → fine-ranking to an end-to-end generative paradigm. While conventional approaches suffer from computational fragmentation and architectural discontinuities, generative recommendation unifies the entire process by transforming recommendation tasks into sequence generation problems.The core innovation lies in introducing Semantic ID – compressing the massive, disordered item ID space into a compact, semantically meaningful, and interpretable representation. To advance this direction, the team launched MiniOneRec, the first fully open-source generative recommendation framework. Built upon open-source large language models (e.g., Qwen2.5) and public datasets, it implements a complete pipeline spanning from recommendation request generation and candidate item generation to diversity filtering and reinforcement learning optimization.A standout feature is its comprehensive Semantic ID Toolbox (e.g., RQ-VAE), which significantly enhances ID compression efficiency and semantic fidelity. Experimental results demonstrate that MiniOneRec not only validates scaling laws in recommendation – where larger models yield better performance – but also exhibits superior capabilities in cross-domain generalization and cold-start scenarios, providing a solid foundation for academic research and industrial replication.

Jun Yu
Professor (Harbin Institute of Technology, Shenzhen)
Multimodal Perception and Autonomous Decision-Making for Low-Altitude Embodied Intelligent Agents

Jun Yu emphasized that for the burgeoning trillion-dollar low-altitude economy, embodied intelligent agents must overcome three core challenges: clear perception, sound reasoning, and rapid response.At the perception level, to overcome the limitations of traditional sensors in power consumption and weight, the team focuses on lightweight monocular/video depth estimation. By leveraging simulation-to-real transfer learning and geometric consistency modeling, they have developed an efficient embodied visual foundation model that demonstrates exceptional performance in detecting subtle obstacles such as wires.In the realm of cognitive reasoning, addressing the gap between the deployability limitations of large models and the capability shortcomings of small models, the talk proposed a collaborative fast-slow architecture: large models for high-level planning, small models for action mapping. This approach incorporates a working memory mechanism to enable the understanding of long instructions and complex task planning.For high-speed autonomous decision-making, to tackle the limitations of traditional reinforcement/imitation learning in high-velocity scenarios, the team pioneered a physically differentiable decision-making method. By integrating physical constraints (such as collision avoidance and motion smoothness) into a differentiable trajectory optimization process, combined with dynamic time allocation, this approach significantly enhances the success rate and stability of high-speed obstacle avoidance for drones while ensuring safety.

Yixin Cao
Professor (Fudan University)
Model Utility Law: Towards Generalizable Evaluation

Confronting the fundamental tension between the near-limitless capabilities of large models and the constraints of finite evaluation datasets, Yixin Cao introduced a novel paradigm of Generalizable Evaluation, aimed at predicting a model's unobserved capabilities from limited test sets. To this end, she presented a core evaluation framework: the Model Utility Law and its corresponding Model Utility Index (MUI).This law posits that evaluating a model should consider not only its task performance but also the effort required to achieve that performance. The MUI quantifies this effort by measuring the ratio of activated capabilities needed for the task to the model's total capabilities, enabling a deeper diagnosis of the model's comprehensive abilities.Based on the MUI, the report categorizes the model optimization process into four distinct trajectories: evolution, accumulation, coarsening, and collapse, providing a clear diagnostic lens for analyzing training dynamics.Furthermore, the framework enables two key applications: first, the PUR metric—defined as the ratio of performance to MUI—allows for more stable and comprehensive model comparisons; second, an evaluation set sampling strategy that uses MUI as a diversity metric can significantly reduce the required evaluation scale while maintaining high consistency with full-set evaluation results. Together, these contributions offer a new pathway toward an efficient, robust, and generalizable evaluation system.

Ningyu Zhang
Associate Professor (Zhejiang University)
Knowledge-Driven Large Language Model Agents

Ningyu Zhang identified the core challenges in deploying large language model agents as knowledge gaps and deficiencies in interactive common sense. To address these, he proposed a three-step, knowledge-driven solution:Step 1: Knowledge Injection via High-Quality Data.To tackle the sparsity of agent trajectory data, the team leveraged human procedural priors to synthesize and filter high-quality trajectories (e.g., the KnowPrompt project). Experiments confirm this method significantly enhances agents' planning and execution capabilities.Step 2: Teaching Agents to Recognize Knowledge Boundaries.Through training frameworks like KnowSelf and KnowRL, agents learn to autonomously decide when to answer directly, when to engage in deep chain-of-thought reasoning, and when to utilize external tools (retrieval/API calls). This leads to more efficient and reliable decision-making.Step 3: Building Lightweight, Pluggable Memory Systems.Addressing the inefficiency of existing memory modules, the team developed systems such as LightMem, which incorporates mechanisms for active filtering, topic segmentation, and offline updates. This significantly reduces token consumption and API call overhead while maintaining global information consistency.This series of solutions has been successfully applied in complex scenarios including data science and underwater navigation (e.g., OceanGPT).

Jian Yang
Associate Professor (Beihang University)
Code Foundation Models and Code Intelligence Agents

Jian Yang observed that code large models are evolving from competing on pure model capability toward building more complex engineering system capabilities. The presentation systematically reviewed the technical evolution of code models, from encoder-only architectures, to GPT-like generative pre-training, and further to the current era of high-quality fine-tuning and open-source proliferation.At the core technical level, the talk emphasized the importance of evaluation through usage, combining the Pass@k metric with unit testing. For long-sequence code tasks, key optimization techniques such as Fill-in-the-Middle (FIM), NTP/MTP parallel decoding, and diffusion-based generation were introduced.Yang identified two major trends in the development of code intelligence agents: cross-lingual knowledge transfer and multi-agent collaboration. The former aims to leverage knowledge from dominant languages like Python to enhance model performance in other languages (e.g., Verilog, Rust). The latter focuses on building a closed-loop system of Agent × Reinforcement Learning × Environmental Feedback, where diverse agents assume roles such as product manager, architect, programmer, and test engineer. Through collaborative work and continuous iteration, this system can systematically accomplish complex software engineering tasks—from code completion and test generation to repository-level operations—ultimately achieving the leap from writing code to developing software.

Jie Zhang
Senior Algorithm Expert (AI Research Institute, China Telecom)
Large Model-based Table Understanding: Exploration and Practice

Facing the three core challenges of oversized tables, multi-table relationships, and complex reporting in enterprise applications, Zhang Jie presented a comprehensive technical solution. Traditional approaches—whether prone to high hallucination rates in large models or inefficient in engineering methods—struggle to meet industrial-grade demands. To address this, the team established an end-to-end practice covering data, models, and applications.In data and evaluation, the team not only open-sourced a challenging table question-answering benchmark (T2R-Bench) with innovative annotations of the model's reasoning process, but also constructed an enterprise-grade dataset to tackle the severe shortage of high-quality multi-table relational data in industry.At the model level, the team introduced the industry's first reinforcement learning-based table understanding model. By adopting a select region first, then find answer approach and incorporating dual rewards for both region + answer, the model's ability to parse table structures was significantly enhanced. Additionally, an innovative code similarity reward in reinforcement learning was applied to improve the accuracy of code generation tasks.This series of technologies has been successfully implemented in self-developed intelligent office assistants and cash flow analysis tools, enabling end-to-end capabilities—from automatic function formula generation and multi-dimensional comparative analysis to one-click professional report generation. It provides a powerful toolset for tackling complex data analysis challenges in real-world business scenarios.

The forum concluded successfully. Participating experts widely agreed that unified paradigms, trustworthy evaluation, and software-hardware co-design will be the main threads guiding large models toward becoming more reliable, compliant, and efficient. The coordinated evolution from model capabilities to system capabilities is fostering more deployable, reproducible, and open ecosystems and solutions.