Research Highlights of Our College (XX)A Self-Evolving Multimodal Jailbreaking Framework for Image-to-Video Generation Models

发布者：汤靖玲发布时间：2026-03-26浏览次数：82

Recently, Professor Shan Caifeng and Assistant Professor Lyu Yueming from the School of Intelligence Science and Technology at Nanjing University, in collaboration with Meituan, Shanghai Jiao Tong University, and other institutions, proposed the first multimodal self-evolution jailbreaking attack framework for Image-to-Video (I2V) generation models, named RunawayEvil. This work has been accepted by CVPR 2026.

Project Page：https://xzxg001.github.io/RunawayEvil/

Paper：https://arxiv.org/pdf/2512.06674

Code：https://github.com/DeepSota/RunawayEvil

This framework aims to develop a security assessment tool that enables automatic iteration and cross-scenario transferability, targeting the cross-modal input mechanisms and spatiotemporal generation characteristics of I2V models. It seeks to overcome the key bottlenecks of existing jailbreaking methods in I2V scenarios, namely: the ineffectiveness of single-modal perturbations, the inflexibility of static templates, and the difficulty of modeling cross-modal coordination.

Figure 1: Overall Workflow of the RunawayEvil Framework and Illustration of the Strategy-Tactic-Action Paradigm

The research team proposes a systematic attack paradigm centered on Strategy–Tactic–Action, which organizes the jailbreaking process into a closed-loop decision-making chain that is reusable and self-evolving. The overall framework consists of three core modules: the Strategic Awareness Command Unit (SACU), the Multimodal Tactical Planning Unit (MTPU), and the Tactical Action Unit (TAU). It adopts a two-stage pipeline to achieve continuous iteration from strategy learning to execution feedback. In the evolution stage, SACU is trained and its strategy set is expanded, enabling it to move beyond reliance on manually crafted templates and learn to select more appropriate strategies based on different inputs. In the execution stage, SACU outputs a strategy, MTPU further generates cross-modality-coordinated tactical instructions on both the image and text sides, and TAU performs iterative editing and security assessment, feeding successful experiences back into the memory bank to support subsequent evolution.

Figure 2: Schematic of the SACU Self-Evolution Framework

In terms of concrete design, SACU serves as the decision brain of the framework and encompasses three key capabilities. First, the Strategy Customization Agent formulates strategy selection as an optimizable decision-making problem through reinforcement learning, pursuing jailbreaking success while also considering stealthiness objectives such as the suspiciousness of the textual side and the perceptibility of modifications on the image side. Second, the Strategy Exploration Agent automatically generates new strategies based on historically successful examples, alleviating the issues of strategy library rigidity and insufficient coverage. Third, the Strategy Memory Bank stores successful cases in a structured manner—including image-text inputs, editing instructions, video prompts, and adopted strategies—providing reusable experience for subsequent retrieval and generation.

Experimental results show that RunawayEvil significantly improves the jailbreaking success rate on multiple mainstream open-source and commercial I2V models, while maintaining stable advantages under different safety evaluators. Compared with single-modal and static template approaches, the framework demonstrates superior robustness in attack effectiveness and cross-scenario generalization, providing support for vulnerability analysis and risk assessment of video generation systems. Future work will further adapt the framework to more I2V architectures and real-world scenarios, contributing to the development of more comprehensive security systems for video generation.