In Spring 2024, TextGroup linguist Andreas Schramm presented on how deeply time is embedded in language and on his planned study with Mike Mensink and Hannah Riddle combining Jeannette Gundel’s Givenness Hierarchy with his research on linguistic aspect. At that meeting, UMN Computer Science professor Dongyeop Kang proposed a collaborative benchmark effort — which would become known as CogBench — involving his doctoral candidate Karin de Langis and several other members from TextGroup and the Computer Science Department.
The vision behind CogBench (Cognitive Benchmark) was to move beyond standard NLP evaluations and instead measure LLMs against established psycholinguistic and cognitive tasks drawn from decades of human research. Rather than treating LLMs as black boxes to be probed only with language benchmarks, CogBench would treat them as cognitive systems whose processing could be compared directly with human readers — a framing that opened the door to a genuinely interdisciplinary collaboration between cognitive scientists, linguists, and NLP researchers.
The ensuing initial effort was highly productive and resulted in the paper How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs. It consisted of comparing Andreas Schramm’s cognitive psycholinguistic data from human readers with output from seven LLMs. It was a deeply gratifying experience during which both sides learned a lot about each other’s areas of expertise. We communicated multiple times over the summer of 2024 between the US and Europe to get LLMs to understand and follow human instructions — at times enlisting the help of the 180-mph German bullet train ICE to propel our research forward. An extended version with four human populations (undergraduate and graduate native-English students; low- and high-advanced nonnative English students) was well-received at the conference of the American Association of Applied Linguists in the spring of 2026.
The second part of the CogBench project was conducted with the help of additional TextGroup members Püren Öncel, Andrew Elfenbein, and Laura Allen. Two papers came out of this effort. One tested LLM comprehension of the set of stories from Ed O’Brien’s lab — including the infamous story with Mary, the cheeseburger-eating vegetarian (Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?). The other investigated the performance of LLMs on a set of classic executive-functioning tasks (Strong Memory, Weak Control: An Empirical Study of Executive Functioning in LLMs).
The Mary study revealed that even state-of-the-art LLMs often fail to notice the kind of glaring narrative inconsistencies that human readers catch effortlessly, suggesting that these models lean on surface-level coherence more than on genuine updating of a world model. The executive-functioning study, in turn, surfaced a striking asymmetry: LLMs exhibit remarkably strong working memory but comparatively weak cognitive control, struggling to suppress prepotent responses or flexibly shift between rules on classic tasks such as Stroop and Wisconsin Card Sorting. Together, the three CogBench papers point toward a consistent picture of today’s LLMs as systems with rich representations but limited metacognitive regulation — and toward cognitive science as a productive source of benchmarks for the next generation of models.
We look forward to future collaborations between TextGroup and the Computer Science Department!