Benchmarking large language models for bio-image analysis code generation

Haase, Robert; Tischer, Christian; Scherf, Nico

doi:10.1101/2024.04.19.590278

Lokale TagsFreigabegeschichteDetailsÜbersicht

Benchmarking large language models for bio-image analysis code generation

Haase, R., Tischer, C., & Scherf, N. (2024). Benchmarking large language models for bio-image analysis code generation. bioRxiv. doi:10.1101/2024.04.19.590278.

Item is Freigegeben

einblenden: alle ausblenden: alle

Basisdaten

einblenden: ausblenden:

Datensatz-Permalink: https://hdl.handle.net/21.11116/0000-000F-2FC0-4 Versions-Permalink: https://hdl.handle.net/21.11116/0000-000F-3AAD-E

Genre: Preprint

Dateien

einblenden: Dateien

ausblenden: Dateien

:

Haase_pre_v2.pdf (Preprint), 3MB

Öffnen Speichern

Datei-Permalink:
https://hdl.handle.net/21.11116/0000-000F-3AAE-D

Name:
Haase_pre_v2.pdf

Beschreibung:
-

OA-Status:
Grün

Sichtbarkeit:
Öffentlich

MIME-Typ / Prüfsumme:
application/pdf / [MD5]

Technische Metadaten:

Öffnen

Copyright Datum:
-

Copyright Info:
-

Lizenz:
https://creativecommons.org/licenses/by/4.0/

Externe Referenzen

einblenden:

Urheber

einblenden:

ausblenden:

Urheber:
Haase, Robert, Autor
Tischer, Christian, Autor
Scherf, Nico¹, Autor

Affiliations:
1Method and Development Group Neural Data Science and Statistical Computing, MPI for Human Cognitive and Brain Sciences, Max Planck Society, ou_3282987

Inhalt

einblenden:

ausblenden:

Schlagwörter: -

Zusammenfassung: In the computational age, life-scientists often have to write Python code to solve bio-image analysis (BIA) problems. Many of them have not been formally trained in programming though. Code-generation, or coding assistance in general, with Large Language Models (LLMs) can have a clear impact on BIA. To the best of our knowledge, the quality of the generated code in this domain has not been studied. We present a quantitative benchmark to estimate the capability of LLMs to generate code for solving common BIA tasks. Our benchmark currently consists of 57 human-written prompts with corresponding reference solutions in Python, and unit-tests to evaluate functional correctness of potential solutions. We demonstrate our benchmark here and compare 6 state-of-the-art LLMs. To ensure that we will cover most of our community needs we also outline mid- and long-term strategies to maintain and extend the benchmark by the BIA open-source community. This work should support users in deciding for an LLM and also guide LLM developers in improving the capabilities of LLMs in the BIA domain.

Details

einblenden:

ausblenden:

Sprache(n): eng - English

Datum: Online veröffentlicht: 2024-04-25

Publikationsstatus: Online veröffentlicht

Seiten: -

Ort, Verlag, Ausgabe: -

Inhaltsverzeichnis: -

Art der Begutachtung: -

Identifikatoren: DOI: 10.1101/2024.04.19.590278

Art des Abschluß: -

ausblenden:

Titel: bioRxiv

Genre der Quelle: Webseite

Urheber:

Affiliations:

Ort, Verlag, Ausgabe: -

Seiten: - Band / Heft: - Artikelnummer: - Start- / Endseite: - Identifikator: -

Datensatz

Basisdaten

Dateien

Externe Referenzen

Urheber

Inhalt

Details

Veranstaltung

Entscheidung

Projektinformation

Quelle 1