It's also a common fallacy to think you need a 3D sensor for high-quality 3D work. (indeed, state of the art SLAM systems like OrbSLAM3 are typically monocular)
The models used typically label bones and then make a good guess on what the depth of the bones should be given typical human proportions and typical human poses. (depth cues) They're trained from researchers dancing in front of more expensive sensors like the one you linked, but provide near-identical output in their absence.