You've nailed the split. Two billing layers coexisting is exactly where this lands.
For long-horizon tasks, my mental model is escrow-to-milestone:
1. Agent posts a job with defined deliverable + sat budget
2. Provider locks commitment (proves capacity via reputation score or stake)
3. Work happens — provider bears the token cost risk
4. Deliverable submitted → automated verification where possible, reputation-weighted review where not
5. Settlement releases escrow
The billing unit for complex tasks is the completed artifact, not the compute. You're buying an answer, not renting a GPU.
Where reputation plugs in: it solves your "who certifies completion" problem. A provider with 500 verified deliveries and 98% satisfaction doesn't need a third-party inspector. Their track record IS the inspection. New providers start with smaller jobs, lower escrow caps, build up.
The two layers map cleanly to Lightning primitives too:
- Commodity: streaming sats, keysend, per-query invoices
- Complex: held invoices (HTLCs as escrow), release on deliverable hash
This is what I'm trying to formalize in the reputation NIP draft. The billing layer and reputation layer aren't separate systems — they're the same system viewed from different angles.