CogBench: a large language model walks into a psychology lab

Coda-Forno, J; Binz, M; Wang, JX; Schulz, E

doi:10.48550/arXiv.2402.18225

Local TagsRelease HistoryDetailsSummary

CogBench: a large language model walks into a psychology lab

Coda-Forno, J., Binz, M., Wang, J., & Schulz, E. (submitted). CogBench: a large language model walks into a psychology lab.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000E-8050-6 Version Permalink: https://hdl.handle.net/21.11116/0000-000E-8051-5

Genre: Preprint

Files

show Files

Locators

show

hide

Locator:
https://arxiv.org/pdf/2402.18225.pdf (Any fulltext) Open Access status unknown

Description:
-

OA-Status:
Not specified

Creators

show

hide

Creators:
Coda-Forno, J¹, Author
Binz, M¹, Author
Wang, JX, Author
Schulz, E¹, Author

Affiliations:
1Research Group Computational Principles of Intelligence, Max Planck Institute for Biological Cybernetics, Max Planck Society, ou_3189356

Content

show

hide

Free keywords: -

Abstract: Large language models (LLMs) have significantly advanced the field of artificial intelligence. Yet, evaluating them comprehensively remains challenging. We argue that this is partly due to the predominant focus on performance metrics in most benchmarks. This paper introduces CogBench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs' behavior. We apply CogBench to 35 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance LLMs' behavior. Finally, we explore the effects of prompt-engineering techniques. We discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.

Details

show

hide

Language(s):

Dates: Submitted: 2024-02

Publication Status: Submitted

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.48550/arXiv.2402.18225

Degree: -

Event

show

Legal Case

show

Project information

show

Source

show