Swetha Sirnam

I'm a graduating PhD student at Center for Research in Computer Vision (CRCV), University of Central Florida advised by Prof. Mubarak Shah. Most of my research focuses on self-supervised learning, multi-modal learning, covering MLLMs, Spatial and Temporal Reasoning in VideoLLMs, Visual Preference Alignment, Safety and Bias analysis. Previously, I graduated with Bachelor in Technology (B.Tech) with Honors and MS by research from International Institute of Information Technology, Hyderabad (IIIT-H), where I was advised by Prof. C. V. Jawahar (Center for Visual Information Technology [CVIT], IIIT-H) and Prof. Vineeth N Balasubramanian, IIT-H.

Email / Scholar / Twitter  /  Github

profile photo

Updates

  Oct'25: A paper accepted at ICCV 2025: Oral presentation! (Top 0.6% papers)💥💥
  Jun'25: Co-oranized Workshop on VideoLLMs at CVPR 2025 💥
  Jun'25: Area Chair for Workshop on VideoLLMs at CVPR 2025 💥
  Mar'25: Organized Challenges on VideoLLMs at CVPR 2025
  Sep'24: Third place - Perception Challenge at ECCV 2024
  Jul'24: A First author paper "X-Former" accepted at ECCV 2024 💥
  May'24: Started summer internship at Amazon, Palo Alto, CA
  May'23: Started summer internship at Amazon, Palo Alto, CA
  Jul'23: A First author paper "Multi-Sinkhorn Knopp" accepted at ICCV 2023 💥
  May'22: Started summer internship at Amazon, Seattle, CA
  Jun'21: UDE paper Oral presentation at CVPR L2ID 2021 💥
  May'21: A First author paper "UDE" accepted at ICIP 2021

Work Experience

PhD Applied Scientist Intern
Amazon, Palo Alto, California, USA. May 2024

  • Enhanced visual preference alignment for improving general MLLM understanding.
  • Proposed Self-Supervised Flexible Multi-Group Preference Alignment framework without relying on any external experts models or human annotators.
  • Outperformed state-of-the-art methods using our novel approach, achieving superior results.
PhD Applied Scientist Intern
Amazon, Palo Alto, California, USA. May 2023

  • Enhanced the fine-grained visual perception capabilities of MLLMs, outperforming BLIP-2 & Flamingo.
  • Performed large-scale multi-modal training with millions of samples.
  • Our X-Former achieved state-of-the-art performance on both fine-grained and general tasks, paper accepted at ECCV 2024.
PhD Applied Scientist Intern
Amazon, Seattle, Washington, USA. May 2022

  • Developed a detailed video description framework that employs VideoSwin Transformer for long-form videos.
  • Proposed to spatio-temporally match each textual concept to a visual concept for fine-grained representation learning.
  • Demonstrated improved performance on captioning task with detailed error analysis
Analyst
Goldman Sachs, Bengaluru, Karnataka, India.

  • Worked with Investment Banking Team primarily.
Intern
Goldman Sachs, Bengaluru, Karnataka, India.

  • Interned with Investment Banking Team.

Featured Research

My current research interest lies in the field of self-supervised learning for video and image understanding, including multi-modal learning.

PontTuset Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset and Benchmark for Biometric Safety
Younggun Kim*, Sirnam Swetha*, Fazil Kagdi, Mubarak Shah (*equally contributing first author)
arxiv, 2025
paper / Data / arxiv / bibtex / code / project page

PontTuset GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space
David G Shatwell, Ishan Rajendrakumar Dave, Sirnam Swetha, Mubarak Shah
ICCV (Oral), 2025
paper / supp / arxiv / bibtex / code / project page

PontTuset ImplicitQA: Going beyond frames towards Implicit Video Reasoning
Sirnam Swetha, Rohit Gupta, Parth Parag Kulkarni, David G Shatwell, Jeffrey A Chan Santiago, Nyle Siddiqui, Joseph Fioresi, Mubarak Shah
arxiv, 2025
paper / Data / arxiv / bibtex / code / project page

PontTuset TimeLogic: A Temporal Logic Benchmark for Video QA
Sirnam Swetha, Hilde Kuehne, Mubarak Shah
arxiv, 2025
paper / arxiv / bibtex / project page

PontTuset The Telephone Game: Evaluating Semantic Drift in Unified Models
Sabbir Mollah, Rohit Gupta*, Sirnam Swetha*, Qingyang Liu^, Ahnaf Munir^, Mubarak Shah (*equal contribution)
arxiv, 2025
paper / Data / arxiv / bibtex / code / project page

PontTuset StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
Nyle Siddiqui, Rohit Gupta, Sirnam Swetha, Mubarak Shah
arxiv, 2025
arxiv / bibtex

PontTuset X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
Sirnam Swetha, Jinyu Yang, Tal Neiman, Mamshad Nayeem Rizve, Son Tran, Benjamin Yao, Trishul Chilimbi, Mubarak Shah
ECCV, 2024
paper / supplement / arxiv / bibtex / project page

PontTuset Multi-SK: Preserving Modality Structure Improves Multi-Modal Learning
Sirnam Swetha, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah
ICCV, 2023
paper / supplement / arxiv / bibtex / code / project page

PontTuset Unsupervised Discriminative Embedding for Action Learning in Complex Activities
Sirnam Swetha, Hilde Kuehne, Yogesh S Rawat, Mubarak Shah
ICIP (Oral), 2021
paper / arxiv / bibtex / project page

PontTuset Sequence-to-Sequence Learning for Human Pose Correction in Videos
Sirnam Swetha, Vineeth N Balasubramanian, CV Jawahar
ACPR, 2017
paper / bibtex / project page

PontTuset Efficient Object Annotation for Surveillance and Automotive Applications
Sirnam Swetha, Anand Mishra, Guruprasad M Hegde, CV Jawahar
WACVW, 2016
paper / bibtex / project page

PontTuset Efficient Object Annotation for Surveillance and Automotive Applications
Sirnam Swetha*, Rajat Aggarwal*, Anoop M Namboodiri, Jayanthi Sivaswamy, CV Jawahar (*equal contribution)
ICDAR, 2015
paper / bibtex / project page

Miscellanea

Micropapers

Wildlife Action Recognition using Deep Learning

Academic Service

Co-Organizer, CVPR Workshop on VideoLLMs 2025
Area Chair, CVPR Workshop on VideoLLMs 2025
Reviewer: CVPR, ICLR, NeurIPS, WACV

Feel free to steal this website's source code. Also, consider using Leonid Keselman's Jekyll fork of this page.