Benchmarking large language models for bio-image analysis code generation

Haase, Robert; Tischer, Christian; Scherf, Nico

doi:10.1101/2024.04.19.590278

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Preprint

Benchmarking large language models for bio-image analysis code generation

MPS-Authors

/persons/resource/persons201756

Scherf, Nico
Method and Development Group Neural Data Science and Statistical Computing, MPI for Human Cognitive and Brain Sciences, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

Haase_pre_v2.pdf
(Preprint), 3MB

Supplementary Material (public)

There is no public supplementary material available

Citation

Haase, R., Tischer, C., & Scherf, N. (2024). Benchmarking large language models for bio-image analysis code generation. bioRxiv. doi:10.1101/2024.04.19.590278.

Cite as: https://hdl.handle.net/21.11116/0000-000F-2FC0-4

Abstract

In the computational age, life-scientists often have to write Python code to solve bio-image analysis (BIA) problems. Many of them have not been formally trained in programming though. Code-generation, or coding assistance in general, with Large Language Models (LLMs) can have a clear impact on BIA. To the best of our knowledge, the quality of the generated code in this domain has not been studied. We present a quantitative benchmark to estimate the capability of LLMs to generate code for solving common BIA tasks. Our benchmark currently consists of 57 human-written prompts with corresponding reference solutions in Python, and unit-tests to evaluate functional correctness of potential solutions. We demonstrate our benchmark here and compare 6 state-of-the-art LLMs. To ensure that we will cover most of our community needs we also outline mid- and long-term strategies to maintain and extend the benchmark by the BIA open-source community. This work should support users in deciding for an LLM and also guide LLM developers in improving the capabilities of LLMs in the BIA domain.