BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
2510.20095v2
cs.CV, cs.CL, cs.LG
2025-10-27
Авторы:
Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
Abstract
This work investigates descriptive captions as an additional source of
supervision for biological multimodal foundation models. Images and captions
can be viewed as complementary samples from the latent morphospace of a
species, each capturing certain biological traits. Incorporating captions
during training encourages alignment with this shared latent structure,
emphasizing potentially diagnostic characters while suppressing spurious
correlations. The main challenge, however, lies in obtaining faithful,
instance-specific captions at scale. This requirement has limited the
utilization of natural language supervision in organismal biology compared with
many other scientific domains. We complement this gap by generating synthetic
captions with multimodal large language models (MLLMs), guided by
Wikipedia-derived visual information and taxon-tailored format examples. These
domain-specific contexts help reduce hallucination and yield accurate,
instance-based descriptive captions. Using these captions, we train BioCAP
(i.e., BioCLIP with Captions), a biological foundation model that captures rich
semantics and achieves strong performance in species classification and
text-image retrieval. These results demonstrate the value of descriptive
captions beyond labels in bridging biological images with multimodal foundation
models.
Ссылки и действия
Дополнительные ресурсы: