GLIGEN is a new approach that enhances text-to-image generation models by allowing them to be conditioned on additional grounding inputs. It achieves impressive performance in generating images from text by including caption and bounding box condition inputs. GLIGEN takes advantage of a pre-trained model by freezing its weights and adding trainable layers to incorporate grounding information. This enables the model to have a better understanding of spatial relationships and concepts. GLIGEN outperforms other models in generating images based on specific instructions and can handle counterfactual scenarios. It can also use reference images for more detailed and stylistic output. Additionally, GLIGEN has the capability to inpaint images based on provided bounding boxes. Overall, GLIGEN improves the controllability and versatility of text-to-image generation models.