And then there's evaluation, or lack thereof:
Google is advertizing Gemini as an everything machine---a general purpose model that can be used in many different ways. In other words: sthg that cannot be evaluated, since it doesn't have a specific purpose.
What stands in for evaluation are "benchmarks", but these benchmarks lack construct validity. What are they supposed to be measuring? What shows that they do measure that? How does that relate to the intended use case of the technology?
/5