You're vastly overestimating the weight of these models. We're talking about a couple layers suitable to run in realtime on a CPU, or even lighter non-network models.
These are not even remotely in the same class as the "AI" du-jour.
I'm also not talking at all about face tracking. I'm talking about body tracking with tagging of bodyparts in the image and refinement of the pose by guessing at the depth of the end of each bone.