Dates
Wednesday, June 26, 2024 - 02:00pm to Wednesday, June 26, 2024 - 03:00pm
Location
NCS 120
Event Description


Abstract:
Humans make eye movements when viewing static scenes. Their visual system does not process the entire scene, but samples the most important regions by their high-resolution fovea to perceive and understand the objects and scenes they want to see. Modeling human attention in natural scenes is important to understand human behaviors. Human attention modeling includes saliency prediction and scanpath prediction, which learn spatial human attention and spatiotemporal human attention, respectively. Saliency prediction is static human attention modeling by predicting the probability distribution map over human gaze,i.e.where are human fixations when they naturally explore an image. With the development of deep learning, saliency prediction has achieved significant performance. However, deep saliency models heavily depend on the volume of data to understand
human attention, yet collecting eye-tracking data is both time-consuming and expensive, especially in domains that are not widely explored.We focus on this limitation of saliency prediction and propose a novel view of solving this problem.Research has demonstrated a high correlation between image descriptions and saliency, noting that objects described have an 87% likelihood of attaining higher values onsaliency maps. Foundation models are trained on extensive image-text pairsand possess robust generalization capabilities. Therefore, leveraging foundationmodels to get image captions and localizing described objects in the image canperform as external knowledge for saliency prediction, alleviating therequirementof eye-tracking data collection. We generate image descriptions containinghuman attention information from LLaVA, utilize the cross-attention maps ofthe diffusion model to map the nouns and verbs in the descriptions to each pixelin the image, and propose a novel structure AttnSal to integrate this externalknowledge into saliency prediction. We show that conditioned by imagedescriptionsgenerated from LLaVA, the cross-attention maps of Stable Diffusionl,i.e.pixels respond more intensively to words that describe them, exhibit a certaindegree of similarity to saliency. Our saliency prediction model AttnSal significantly improves saliencypredictionwhen training data is limited, i.e. 1% of the SALICON, by 8.6% on KL(Kullback-Leibler) Divergence. Our work opens new opportunities for utilizingthe perception and generalization ability of foundation models in humanattentionunderstanding.

Event Title
Ph.D. Research Proficiency Presentation: 'Data-Efficient Saliency Prediction Leveraging Stable Diffusion Attention,' Ruoyu Xue