DiffProxy
Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

1PCA Lab, Nanjing University of Science and Technology 2Nanjing University, School of Intelligent Science and Technology
*Corresponding authors
DiffProxy teaser

DiffProxy is trained exclusively on synthetic data and achieves robust generalization to real-world scenarios. Our framework accepts diverse prompts (visual and textual), handles difficult poses, generalizes to challenging environments, and supports partial views with flexible view counts. Three key advantages: (i) Annotation bias-free—training on synthetic data avoids fitting biases from real datasets; (ii) Flexible—adapts to varying view counts, handles partial observations, and works across diverse capture conditions; (iii) Cross-data generalization—achieves strong performance across unseen real-world datasets without requiring real training pairs.

Method Overview

The figure summarizes our pipeline from multi-view images to final mesh recovery:

(a)
Given multi-view images and cameras parameters, the proxy generator produces per-view SMPL-X proxies $\mathbf{P}_v$.
(b)
Hand-focused regions inferred from the body proxies are incorporated as additional views for hand refinement.
(c)
Test-time scaling runs $K$ stochastic inference attempts, aggregates predictions through median (UV) and majority voting (segmentation), and computes pixel-wise uncertainty to produce a weight map $\mathbf{W}_v$ that guides fitting.
(d)
The body is fitted and then refined with hand-specific proxies to recover the final human mesh.

Architecture

Our model is built on Stable Diffusion~2.1 with a frozen UNet backbone, equipped with three conditioning signals ($\mathbf{c}_{\text{txt}}$, $\mathbf{c}_{\text{T2I}}$, $\mathbf{c}_{\text{DINO}}$) and four trainable attention modules ($\mathcal{A}_{\mathrm{text}}$, $\mathcal{A}_{\mathrm{img}}$, $\mathcal{A}_{\mathrm{cm}}$, $\mathcal{A}_{\mathrm{epi}}$) for multi-view consistent proxy generation.

Synthetic Data

Our synthetic dataset contains 108K+ clothed SMPL-X subjects rendered into 867K+ images across eight randomized views, featuring HDR lighting, realistic occlusions, and physics-based clothing. Fully synthetic annotations ensure bias-free supervision, enabling robust zero-shot generalization to real-world scenarios.

Our Results

Quantitative Comparisons

3dhp rich behave
Method PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE
SMPLest-X 33.7*51.6*48.8*67.1* 26.5*42.8*33.6*51.7* 29.3*49.5*43.0*65.2*
Human3R 57.0106.473.6129.2 46.280.156.394.1 36.691.350.3108.0
SAM3DBody 40.556.256.179.4 33.849.842.360.1 28.242.742.255.7
U-HMR 69.1*147.8*81.9*169.9* 66.1140.882.9168.7 45.8118.153.1134.2
MUC 37.947.9 33.2*40.5* 25.837.1
HeatFormer 34.8*59.8*42.8*66.4* 44.988.863.1106.7 33.867.247.276.8
EasyMoCap 47.685.559.693.3 30.439.242.350.0 26.452.940.163.1
Ours 33.441.444.650.6 22.327.925.128.3 24.033.233.341.2
Unit: mm
moyo 4ddress 4ddress-partial
Method PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE PA-MPJPEMPJPEPA-MPVPEMPVPE
SMPLest-X 64.0*101.2*77.0*121.1* 35.253.852.472.0 75.4106.7117.3147.6
Human3R 94.2149.7111.0177.7 30.556.443.671.5 42.076.058.593.0
SAM3DBody 43.861.253.573.9 28.343.541.957.4 44.760.170.384.8
U-HMR 110.3234.5131.2274.6 41.677.495.753.0 66.7146.986.8185.0
MUC 82.573.2 28.039.5 62.697.6
HeatFormer 85.7149.5106.8171.5 43.869.964.588.8 140.1283.5174.8318.6
EasyMoCap 44.165.660.976.5 20.927.832.739.0 79.6447.1120.7466.9
Ours 27.334.632.739.4 17.220.924.827.1 22.727.231.534.2
* Method was trained on the corresponding dataset.

Qualitative Comparisons

Benefiting from the strong visual priors of diffusion model and training exclusively on synthetic data, our method achieves substantial improvements over other approaches. In many cases, our results are on par with, or even visually superior to, the ground truth (which is typically derived from a fitting-based pipeline with additional signals).

BibTeX

@article{wang2025diffproxy,
  title={DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies},
  author={Wang, Renke and Zhang, Zhenyu and Tai, Ying and Yang, Jian},
  journal={arXiv preprint; identifier to be added},
  year={2025}
}