WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
2510.22276v1
cs.CV, cs.CL
2025-10-29
Авторы:
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe, Naoaki Okazaki
Abstract
Large-scale and high-quality image-text pair datasets play an important role
in developing high-performing Vision-Language Models (VLMs). In this work, we
introduce WAON, a large-scale and high-quality Japanese image-text pair dataset
containing approximately 155 million examples, collected from Common Crawl. Our
dataset construction pipeline employs various techniques, including filtering
and deduplication, which have been shown to be effective in previous studies.
To evaluate its effectiveness, we also construct WAON-Bench, a manually curated
benchmark for Japanese cultural image classification, consisting of 374
classes. To assess the effectiveness of our dataset, we conduct experiments
using both WAON and the Japanese subset of ReLAION, one of the most widely used
vision-language datasets. We fine-tune SigLIP2, a strong multilingual model, on
both datasets. The results demonstrate that WAON enhances model performance on
WAON-Bench more efficiently than ReLAION and achieves higher accuracy across
all evaluated benchmarks. Furthermore, the model fine-tuned on WAON achieves
state-of-the-art performance on several Japanese cultural benchmarks. We
release our dataset, model, and code at https://speed1313.github.io/WAON.
Ссылки и действия
Дополнительные ресурсы: