|
|
More Images, More Problems? A Controlled Analysis of VLM Failure Modes
Anurag Das,
Adrian Bulat,
Alberto Baldrati,
Ioannis Metaxas,
Bernt Schiele
Georgios Tzimiropoulos,
Brais Martinez
arxiv, 2026
Although Large VisionâLanguage Models (LVLMs) excel on many tasks, their ability to reason over multiple images is poorly understood and insufficiently analyzed. We introduce MIMIC, a diagnostic benchmark that exposes fundamental multi-image failures in LVLMs and propose complementary data-generation and attention-masking remedies that significantly improve cross-image reasoning and achieve state-of-the-art performance on multi-image benchmarks.
|
|
|
ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better
Mriganka Nath*,
Anurag Das*,
Jiahao Xie,
Bernt Schiele
arxiv, 2026
ClipTTT reduces hallucinations in large vision-language models when corrupted test images cause distribution shift and unreliable generation. It uses CLIP-guided test-time adaptation on a single sample to improve faithfulness under 15 common corruptions without modifying the base LVLM.
|
|
|
Do Instance Priors Help Weakly Supervised Semantic Segmentation?
Anurag Das*,
Anna Kukleva,
Xinting Hu,
Yuki Asano,
Bernt Schiele
arxiv, 2026
SeSAM enables semantic segmentation with weak annotations by adapting SAM through instance decomposition, skeleton-based point prompts, mask selection, and iterative pseudo-label refinement. Combined with semi-supervised learning, it improves segmentation performance across benchmarks while greatly reducing annotation cost compared with dense pixel-level supervision.
|
|
|
MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment
Anurag Das,
Xinting Hu,
Li Jiang,
Bernt Schiele
ECCV, 2024
We propose mask-text alignment based framework for improved semantic segmentation result. We show how mask-text alignment is better than pixel-text alignment for dense semantic segmentation task.
|
|
|
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Yuanwen Yue,
Anurag Das,
Francis Engelmann,
Siyu Tang,
Jan Eric Lenssen
ECCV, 2024
We propose a 3D-aware finetuning of 2D foundational models, resulting in improved performance on downstream tasks such as semantic segmentation and depth estimation.
|
|
|
Weakly-Supervised Domain Adaptive Semantic Segmentation with Prototypical Contrastive Learning
Anurag Das,
Yongqin Xian,
Dengxin Dai,
Bernt Schiele
CVPR, 2023
We propose a common framework to use different weak labels, e.g., image, point and coarse labels from the target domain to reduce the performance gap between UDA and supervised learning.
|
|
|
Urban Scene Semantic Segmentation with Low-Cost Coarse Annotation
Anurag Das,
Yongqin Xian,
Yang He,
Zeynep Akata,
Bernt Schiele
WACV, 2023
We propose to utlize cheaper coarse annotations for urban scene semantic segmentation. Coarse annotation lacks fine Boundary
details and are faster to annotate. Our proposed method obtains competitive performance with coarse annotation along with relatively
free synthetic data compared with fine annotation at a fraction of the annotation budget.
|
|
|
(SP)2Net for Generalized Zero-Label Semantic
Segmentation
Anurag Das,
Yongqin Xian,
Yang He,
Zeynep Akata,
Bernt Schiele
GCPR, 2021 [ Best Master Thesis Award in Germany]
Generalized Zero-shot Semantic Segmentation(GZSS) is a challenging problem as the prior works don't generalize well on unseen classes.
we propose to leverage a class-agnostic segmentation prior provided by superpixels and introduce a superpixel pooling (SP-pooling) module
that improves the performance on GZSS task.
|
|
-
Reviewer: CVPR23, ECCV23, PAMI, IJCV, TMLR, IEEE-TCVST
-
2021: DAGM Best Master Thesis Award 2021 in Germany
-
2021: Saarland Stipendium funded by DAAD
-
2019: Saarbrucken Graduate School of Computer Science Scholarship
-
2017: Institute Merit Scholarship (IIIT Allahabad)
-
2011: Qualified for Indian National Mathematical Olympiad (INMO)
|
|