Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

Inhwa Han*, Serin Yang*, Taesung Kwon, Jong Chul Ye
*Equal contribution
Korea Advanced Institute of Science and Technology (KAIST)

Abstract

Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.

HiPer manipulates a single image using a personalized text embedding and a target text prompt. After training the personalized embedding, it is replaced with the tail of the target embedding during the inference phase. To optimize the personalized embedding vector further, it is multiplied by a scaling factor of 0.8. Our method requires a single images for 3-minute training, and target text prompt for inference.

Personalization

Personalization requires manipulating images with diverse prompts while still maintaining the subject's identity. Stable Diffusion does not incorporate information on the guitar, resulting in variations in the shapes of guitars across different samples. On the other hand, HiPer is capable of preserving the guitar's identity across various target prompts.

More results with animal

The identities of the animals are effectively maintained during modulation.

More results with human

HiPer shows great performance on human face images as well as animal images. Modifications in emotions, wearings, and gestures can be effectively applied while maintaining the subject's identity.

BibTeX

@article{han2023highly,
      title={Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion},
      author={Han, Inhwa and Yang, Serin and Kwon, Taesung and Ye, Jong Chul},
      journal={arXiv preprint arXiv:2303.08767},
      year={2023}
}