Dates
Wednesday, June 26, 2024 - 02:00pm to Wednesday, June 26, 2024 - 03:00pm
Location
NCS 120
Event Description


Abstract:
Humans make eye movements when viewing static scenes. Their visual system does not process the entire scene, but samples the most important regions by their high-resolution fovea to perceive and understand the objects and scenes they want to see. Modeling human attention in natural scenes is important to understand human behaviors. Human attention modeling includes saliency prediction and scanpath prediction, which learn spatial human attention and spatiotemporal human attention, respectively. Saliency prediction is static human attention modeling by predicting the probability distribution map over human gaze, i.e. where are human fixations when they naturally explore an image. With the development of deep learning, saliency prediction has achieved significant performance. However, deep saliency models heavily depend on the volume of data to understand
human attention, yet collecting eye-tracking data is both time-consuming and expensive, especially in domains that are not widely explored. We focus on this limitation of saliency prediction and propose a novel view of solving this problem. Research has demonstrated a high correlation between image descriptions and saliency, noting that objects described have an 87% likelihood of attaining higher values on saliency maps. Foundation models are trained on extensive image-text pairs and possess robust generalization capabilities. Therefore, leveraging foundation models to get image captions and localizing described objects in the image can perform as external knowledge for saliency prediction, alleviating the requirement of eye-tracking data collection. We generate image descriptions containing human attention information from LLaVA, utilize the cross-attention maps of the diffusion model to map the nouns and verbs in the descriptions to each pixel in the image, and propose a novel structure AttnSal to integrate this external knowledge into saliency prediction. We show that conditioned by image descriptions generated from LLaVA, the cross-attention maps of Stable Diffusionl, i.e. pixels respond more intensively to words that describe them, exhibit a certain degree of similarity to saliency. Our saliency prediction model AttnSal significantly improves saliency prediction when training data is limited, i.e. 1% of the SALICON, by 8.6% on KL (Kullback-Leibler) Divergence. Our work opens new opportunities for utilizing the perception and generalization ability of foundation models in human attention understanding.

Event Title
Ph.D. Research Proficiency Presentation: 'Data-Efficient Saliency Prediction Leveraging Stable Diffusion Attention,' Ruoyu Xue