AI researchers have created a language-model testing tool that has discovered major bugs in commercially available cloud AI offerings from Amazon, Google, and Microsoft. Yesterday, a received the Best Paper award from organizers of the Association for Computational Linguistics (ACL) conference. The ACL conference, which took place online this week, is one of the largest annual gatherings for researchers creating language models.
NLP models today are often evaluated based on how they perform on a series of individual tasks, such as answering questions using benchmark data sets with leaderboards . CheckList instead takes a task-agnostic approach, allowing people to create tests that fill in cells in a spreadsheet-like matrix with capabilities (in rows) and test types (in columns), along with visualizations and other resources.