Authors:
(1) Yuwei Guo, The Chinese University of Hong Kong;
(2) Ceyuan Yang, Shanghai Artificial Intelligence Laboratory with Corresponding Author;
(3) Anyi Rao, Stanford University;
(4) Zhengyang Liang, Shanghai Artificial Intelligence Laboratory;
(5) Yaohui Wang, Shanghai Artificial Intelligence Laboratory;
(6) Yu Qiao, Shanghai Artificial Intelligence Laboratory;
(7) Maneesh Agrawala, Stanford University;
(8) Dahua Lin, Shanghai Artificial Intelligence Laboratory;
(9) Bo Dai, The Chinese University of Hong Kong and The Chinese University of Hong Kong.
4.1 Alleviate Negative Effects from Training Data with Domain Adapter
4.2 Learn Motion Priors with Motion Module
4.3 Adapt to New Motion Patterns with MotionLora
5 Experiments and 5.1 Qualitative Results
Text-to-image diffusion models. Diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021; Song et al., 2020) for text-to-image (T2I) generation (Gu et al., 2022; Mokady et al., 2023; Podell et al., 2023; Ding et al., 2021; Zhou et al., 2022b; Ramesh et al., 2021; Li et al., 2022) have gained significant attention in both academic and non-academic communities recently. GLIDE (Nichol et al., 2021) introduced text conditions and demonstrated that incorporating classifier guidance leads to more pleasing results. DALL-E2 (Ramesh et al., 2022) improves text-image alignment by leveraging the CLIP (Radford et al., 2021) joint feature space. Imagen (Saharia et al., 2022) incorporates a large language model (Raffel et al., 2020) and a cascade architecture to achieve photorealistic results. Latent Diffusion Model (Rombach et al., 2022), also known as Stable Diffusion, moves the diffusion process to the latent space of an auto-encoder to enhance efficiency. eDiff-I (Balaji et al., 2022) employs an ensemble of diffusion models specialized for different generation stages.
Personalizing T2I models. To facilitate the creation with pre-trained T2Is, many works focus on efficient model personalization (Shi et al., 2023; Lu et al., 2023; Dong et al., 2022; Kumari et al., 2023), i.e., introducing concepts or styles to the base T2I using reference images. The most straightforward approach to achieve this is complete fine-tuning of the model. Despite its potential to significantly enhance overall quality, this practice can lead to catastrophic forgetting (Kirkpatrick et al., 2017; French, 1999) when the reference image set is small. Instead, DreamBooth (Ruiz et al., 2023) fine-tunes the entire network with preservation loss and uses only a few images. Textual Inversion (Gal et al., 2022) optimize a token embedding for each new concept. Low-Rank Adaptation (LoRA) (Hu et al., 2021) facilitates the above fine-tuning process by introducing additional LoRA layers to the base T2I and optimizing only the weight residuals. There are also encoder-based approaches that address the personalization problem (Gal et al., 2023; Jia et al., 2023). In our work, we focus on tuning-based methods, including overall fine-tuning, DreamBooth (Ruiz et al., 2023), and LoRA (Hu et al., 2021), as they preserve the original feature space of the base T2I.
Animating personalized T2Is. There are not many existing works regarding animating personalized T2Is. Text2Cinemagraph (Mahapatra et al., 2023) proposed to generate cinematography via flow prediction. In the field of video generation, it is common to extend a pre-trained T2I with temporal structures. Existing works (Esser et al., 2023; Zhou et al., 2022a; Singer et al., 2022; Ho et al., 2022b,a; Ruan et al., 2023; Luo et al., 2023; Yin et al., 2023b,a; Wang et al., 2023b; Hong et al., 2022; Luo et al., 2023) mostly update all parameters and modify the feature space of the original T2I and is not compatible with personalized ones. Align-Your-Latents (Blattmann et al., 2023) shows that the frozen image layers in a general video generator can be personalized. Recently, some video generation approaches have shown promising results in animating a personalized T2I model. Tune-a-Video (Wu et al., 2023) fine-tune a small number of parameters on a single video. Text2Video-Zero (Khachatryan et al., 2023) introduces a training-free method to animate a pre-trained T2I via latent wrapping based on a pre-defined affine matrix.