There are certain signals that can be identified as minimizing the likelihood that something is spam:
- having minimal formatting / plain text representation
- minimal references outside of the core semantic domain of the document (e.g. no links to ad servers / no affiliate links)
- maximal referencing of other documents that are unspam-like
Nothing completely flawless, but I'm reminded of xkcd 810: https://xkcd.com/810/