Xinya Ji

I am a Ph.D. student at Nanjing University supervised by Prof. Xun Cao. I am currently a visiting Ph.D. student at the CGL Lab, ETH Zurich, supervised by Prof. Barbara Solenthaler and Dr. Derek Bradley. I have also worked closely with Prof. Feng Xu at Tsinghua University. I received my Bachelor of Science degree in Electronic Science and Engineering from Nanjing University in 2019.

Email  /  CV  /  Scholar  /  Github

profile photo

Research

My research interests focus on computer graphics, particularly 3D facial reconstruction, animation, and audio-visual learning.

Joint Learning of Depth and Appearance for Portrait Image Animation
Xinya Ji, Gaspard Zoss, Prashanth Chandran, Lingchen Yang, Xun Cao,
Barbara Solenthaler, Derek Bradley
ICCV Workshop on Human-Interactive Generation and Editing (HiGen). 2025.
project page / arXiv

We propose to jointly learn the visual appearance and depth simultaneously in a diffusion-based portrait image generator. Once trained, our framework can be efficiently adapted to various downstream applications, such as facial depth-to-image and image-to-depth generation, portrait relighting, and audio-driven talking head animation with consistent 3D output.

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior
Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang,
Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, Xun Cao
3DV, 2025
project page / arXiv

VividTalk can generate realistic and lip-sync talking head videos with expressive facial expression, natural head poses.

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head
Qianyun He, Xinya Ji, Yicheng Gong, Yuanxun Lu, Zhengyu Diao, Linjia Huang,
Yao Yao, Siyu Zhu, Zhan Ma, Songcen Xu, Xiaofei Wu, Zixiao Zhang, Xun Cao, Hao Zhu
ECCV, 2024
project page / arXiv

We present a novel approach for synthesizing 3D talking heads with controllable emotion, enhancing lip synchronization and rendering quality.

AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation
Yifei Zeng, Yuanxun Lu, Xinya Ji, Yao Yao, Hao Zhu, Xun Cao
arXiv:2306.09864.
project page / arXiv

AvatarBooth is a text-to-3D model. It creates an animatable 3D model with your word description. Also, it can generate customized model with 4~6 photos from your phone or a character design generated from diffusion model. You can play with any magic words to change your final character result with fixed identity.

EAMM: One-shot Emotional Talking Face via Audio-based Emotion-aware Motion Model
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu,
Feng Xu, Xun Cao
SIGGRAPH Conference Proceedings , 2022
project page / arXiv

Given a single portrait image, we can synthesize emotional talking faces, where mouth movements match the input audio and facial emotion dynamics follow the emotion source video.

Audio-Driven Emotional Video Portraits
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy,
Xun Cao, Feng Xu
CVPR, 2021
project page / arXiv

Given an audio clip and a target video, our Emotional Video Portraits (EVP) approach is capable of generating emotion-controllable talking portraits and change the emotion of them smoothly by interpolating at the latent space.

Academic Service

  • Reviewer for CVPR, SIGGRAPH, ICCV, ECCV, 3DV, EG, TPAMI etc.


Design and source code from Jon Barron's website.