The figure summarizes our pipeline from multi-view images to final mesh recovery:
Our model is built on Stable Diffusion~2.1 with a frozen UNet backbone, equipped with three conditioning signals ($\mathbf{c}_{\text{txt}}$, $\mathbf{c}_{\text{T2I}}$, $\mathbf{c}_{\text{DINO}}$) and four trainable attention modules ($\mathcal{A}_{\mathrm{text}}$, $\mathcal{A}_{\mathrm{img}}$, $\mathcal{A}_{\mathrm{cm}}$, $\mathcal{A}_{\mathrm{epi}}$) for multi-view consistent proxy generation.
Our synthetic dataset contains 108K+ clothed SMPL-X subjects rendered into 867K+ images across eight randomized views, featuring HDR lighting, realistic occlusions, and physics-based clothing. Fully synthetic annotations ensure bias-free supervision, enabling robust zero-shot generalization to real-world scenarios.
| 3dhp | rich | behave | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | PA-MPJPE | MPJPE | PA-MPVPE | MPVPE | PA-MPJPE | MPJPE | PA-MPVPE | MPVPE | PA-MPJPE | MPJPE | PA-MPVPE | MPVPE |
| SMPLest-X | 33.7* | 51.6* | 48.8* | 67.1* | 26.5* | 42.8* | 33.6* | 51.7* | 29.3* | 49.5* | 43.0* | 65.2* |
| Human3R | 57.0 | 106.4 | 73.6 | 129.2 | 46.2 | 80.1 | 56.3 | 94.1 | 36.6 | 91.3 | 50.3 | 108.0 |
| SAM3DBody | 40.5 | 56.2 | 56.1 | 79.4 | 33.8 | 49.8 | 42.3 | 60.1 | 28.2 | 42.7 | 42.2 | 55.7 |
| U-HMR | 69.1* | 147.8* | 81.9* | 169.9* | 66.1 | 140.8 | 82.9 | 168.7 | 45.8 | 118.1 | 53.1 | 134.2 |
| MUC | 37.9 | — | 47.9 | — | 33.2* | — | 40.5* | — | 25.8 | — | 37.1 | — |
| HeatFormer | 34.8* | 59.8* | 42.8* | 66.4* | 44.9 | 88.8 | 63.1 | 106.7 | 33.8 | 67.2 | 47.2 | 76.8 |
| EasyMoCap | 47.6 | 85.5 | 59.6 | 93.3 | 30.4 | 39.2 | 42.3 | 50.0 | 26.4 | 52.9 | 40.1 | 63.1 |
| Ours | 33.4 | 41.4 | 44.6 | 50.6 | 22.3 | 27.9 | 25.1 | 28.3 | 24.0 | 33.2 | 33.3 | 41.2 |
| moyo | 4ddress | 4ddress-partial | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | PA-MPJPE | MPJPE | PA-MPVPE | MPVPE | PA-MPJPE | MPJPE | PA-MPVPE | MPVPE | PA-MPJPE | MPJPE | PA-MPVPE | MPVPE |
| SMPLest-X | 64.0* | 101.2* | 77.0* | 121.1* | 35.2 | 53.8 | 52.4 | 72.0 | 75.4 | 106.7 | 117.3 | 147.6 |
| Human3R | 94.2 | 149.7 | 111.0 | 177.7 | 30.5 | 56.4 | 43.6 | 71.5 | 42.0 | 76.0 | 58.5 | 93.0 |
| SAM3DBody | 43.8 | 61.2 | 53.5 | 73.9 | 28.3 | 43.5 | 41.9 | 57.4 | 44.7 | 60.1 | 70.3 | 84.8 |
| U-HMR | 110.3 | 234.5 | 131.2 | 274.6 | 41.6 | 77.4 | 95.7 | 53.0 | 66.7 | 146.9 | 86.8 | 185.0 |
| MUC | 82.5 | — | 73.2 | — | 28.0 | — | 39.5 | — | 62.6 | — | 97.6 | — |
| HeatFormer | 85.7 | 149.5 | 106.8 | 171.5 | 43.8 | 69.9 | 64.5 | 88.8 | 140.1 | 283.5 | 174.8 | 318.6 |
| EasyMoCap | 44.1 | 65.6 | 60.9 | 76.5 | 20.9 | 27.8 | 32.7 | 39.0 | 79.6 | 447.1 | 120.7 | 466.9 |
| Ours | 27.3 | 34.6 | 32.7 | 39.4 | 17.2 | 20.9 | 24.8 | 27.1 | 22.7 | 27.2 | 31.5 | 34.2 |
Benefiting from the strong visual priors of diffusion model and training exclusively on synthetic data, our method achieves substantial improvements over other approaches. In many cases, our results are on par with, or even visually superior to, the ground truth (which is typically derived from a fitting-based pipeline with additional signals).
@article{wang2025diffproxy,
title={DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies},
author={Wang, Renke and Zhang, Zhenyu and Tai, Ying and Yang, Jian},
journal={arXiv preprint; identifier to be added},
year={2025}
}