Lifting Motion to the 3D World via 2D Diffusion

Jiaman Li, C. Karen Liu†, Jiajun Wu†
Stanford University


Realistic 3D Motion Generation without Training on 3D Motion Data.

MVLift takes a 2D pose sequence as input to estimate 3D motion in the world coorinate system without training on any 3D motion data. It can generalize to animal poses and interactions.

Abstract

Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion---including both joint rotations and root trajectories in the world coordinate system---using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Video


Method Overview

MVLift Method Overview.

Comparisons of AIST++

(Click to see more results)

We compare our model with three baselines (ElePose, MAS, SMPLify) that do not require training on 3D motion data and two baselines (MotionBERT, WHAM) that need 3D motion data for training. ElePose and MAS cannot predict root trajectories.

Comparisons of NicoleMove

(Click to see more results)

We compare our model with three baselines (ElePose, MAS, SMPLify) that do not require training on 3D motion data and two baselines (MotionBERT, WHAM) that need 3D motion data for training. ElePose and MAS cannot predict root trajectories.

Comparisons of Steezy

(Click to see more results)

We compare our model with three baselines (ElePose, MAS, SMPLify) that do not require training on 3D motion data and two baselines (MotionBERT, WHAM) that need 3D motion data for training. ElePose and MAS cannot predict root trajectories.

Comparisons of OMOMO

(Click to see more results)

We compare our model with SMPLify.

Comparisons of CatPlay

(Click to see more results)

We compare our model with two baselines (MAS, SMPLify).