English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Preprint

Benchmarking large language models for bio-image analysis code generation

MPS-Authors
/persons/resource/persons201756

Scherf,  Nico       
Method and Development Group Neural Data Science and Statistical Computing, MPI for Human Cognitive and Brain Sciences, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

Haase_pre_v2.pdf
(Preprint), 3MB

Supplementary Material (public)
There is no public supplementary material available
Citation

Haase, R., Tischer, C., & Scherf, N. (2024). Benchmarking large language models for bio-image analysis code generation. bioRxiv. doi:10.1101/2024.04.19.590278.


Cite as: https://hdl.handle.net/21.11116/0000-000F-2FC0-4
Abstract
In the computational age, life-scientists often have to write Python code to solve bio-image analysis (BIA) problems. Many of them have not been formally trained in programming though. Code-generation, or coding assistance in general, with Large Language Models (LLMs) can have a clear impact on BIA. To the best of our knowledge, the quality of the generated code in this domain has not been studied. We present a quantitative benchmark to estimate the capability of LLMs to generate code for solving common BIA tasks. Our benchmark currently consists of 57 human-written prompts with corresponding reference solutions in Python, and unit-tests to evaluate functional correctness of potential solutions. We demonstrate our benchmark here and compare 6 state-of-the-art LLMs. To ensure that we will cover most of our community needs we also outline mid- and long-term strategies to maintain and extend the benchmark by the BIA open-source community. This work should support users in deciding for an LLM and also guide LLM developers in improving the capabilities of LLMs in the BIA domain.