But the scheduler also keeps track of jobs, which reference their completion fences, so we have a lifetime loop. That loop is broken at certain points in the job lifecycle, but the fact it exists makes it very difficult to reason about the lifetimes of any of this stuff, and also makes it impossible to implement the requirements imposed by drm_sched via straight refcounting. If you try to refcount the scheduler and have the hw fence hold a reference to it, then the whole thing deadlocks, because the job completion fence might have its final reference dropped by the scheduler itself (when a job is cleaned up after completion), which would lead to trying to free the scheduler from the scheduler workqueue itself.
So now your driver *needs* to implement some kind of deferred cleanup workqueue to free schedulers possibly forever in the future. And also your driver module might be blocked from unloading from the kernel forever, because if any buffers hold on to job completion fences, that means your driver can't unload due to the dependency.
This is just bad design.