We asked each LLM about 46 binary questions and expected certain answers (starting with YES or NO for simplicity). Then it was a string comparison of the answer given by LLM and the expected answer we provided.
OpenAI is pro human rights as well as Meta. Chinese models are everywhere. The most intelligent open source model today (GLM) ranked the worst. Gemini avoided giving answers, and I think it is a kind of censorship, which ended up scoring low.
The idea is after doing proper benchmarks, we can shift AI in good directions ourselves, or demand that other companies score higher. Ultimately consumers of LLMs are better off, more mindful of what they are choosing and talking to.
Open sourced the code and questions:
https://github.com/hrleaderboard/hrleaderboard
Our activist: https://x.com/yangjianli001
Thanks justinmoon (nprofile…vu4x) and HRF (nprofile…rzcm) for the event. It was a great experience and it was "the place to be" this weekend.
