DiffusionDBhttps://poloclub.github.io/diffusiondb
DiffusionDB is a large-scale text-to-image prompt dataset containing 14 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. The dataset primarily consists of English text but also includes other languages such as Spanish, Chinese, and Russian. DiffusionDB provides two subsets, DiffusionDB 2M and DiffusionDB Large, split into 2,000 folders and 14,000 folders, respectively. The dataset includes metadata tables metadata.parquet and metadata-large.parquet, which can be used to access prompts and other attributes of images without downloading all the Zip files. The tables are stored in the Parquet format, making it efficient to query individual columns without reading the entire table.