Post-Retrieval Semantic Re-Ranking via Zero-shot LLMs for Segmentation-Free Document Image Keyword Spotting

Papazis, Stergios; Παπάζης, Στέργιος

Please use this identifier to cite or link to this item: https://olympias.lib.uoi.gr/jspui/handle/123456789/39237

Full metadata record

DC Field	Value	Language
dc.contributor.author	Papazis, Stergios	en
dc.contributor.author	Παπάζης, Στέργιος	el
dc.date.accessioned	2025-07-29T07:15:06Z	-
dc.date.available	2025-07-29T07:15:06Z	-
dc.identifier.uri	https://olympias.lib.uoi.gr/jspui/handle/123456789/39237	-
dc.identifier.uri	http://dx.doi.org/10.26268/heal.uoi.18931	-
dc.rights	Attribution-NonCommercial 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc/3.0/us/	*
dc.subject	Computer vision	en
dc.subject	Keyword spotting	en
dc.subject	Handwritten document analysis	en
dc.subject	Segmentation-free retrieval	en
dc.subject	Vision-language models	en
dc.subject	Re-ranking	en
dc.subject	Semantic relevance feedback	en
dc.subject	Semantic embeddings	en
dc.subject	Zero-shot learning	en
dc.title	Post-Retrieval Semantic Re-Ranking via Zero-shot LLMs for Segmentation-Free Document Image Keyword Spotting	en
dc.title	Σημασιολογική αναδιάταξη αποτελεσμάτων εντοπισμού λέξεων χωρίς Κατάτμηση σε Εικόνες Χειρογράφων με Μεγάλα Γλωσσικά Μοντέλα	el
dc.type	masterThesis	en
heal.type	masterThesis	el
heal.type.en	Master thesis	en
heal.type.el	Μεταπτυχιακή εργασία	el
heal.classification	Computer Vision	en
heal.classification	Content-based Visual Information Retrieval	en
heal.classification	Handwritten Document Analysis	en
heal.classification	Keyword Spotting	en
heal.dateAvailable	2025-07-29T07:16:06Z	-
heal.language	en	el
heal.access	free	el
heal.recordProvider	Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή	el
heal.publicationDate	2025-07-20	-
heal.bibliographicCitation	S. Papazis, "Post-Retrieval Semantic Re-Ranking via Zero-shot LLMs for Segmentation-Free Document Image Keyword Spotting", M.S. thesis, Dept. of Computer Science and Engineering, Univ. of Ioannina, 2025.	en
heal.abstract	The digitization of historical handwritten documents plays a crucial role in their preservation. With preservation largely addressed, the focus has shifted toward enhancing accessibility. This has led to the development of Keyword Spotting (KWS), a Content-Based Image Retrieval (CBIR) task that retrieves and ranks word images based on their similarity to a given query, without requiring prior transcription. Traditional KWS methods typically treat words as visual patterns, relying solely on appearance-based features and often neglecting their underlying semantic content. Even when semantics are considered, it is often within the segmentation-based setting, which assumes prior word-level segmentation, a non-trivial and error-prone requirement, particularly for historical manuscripts. To address these limitations, we propose a novel unsupervised mechanism for semantic relevance feedback to re-rank the initial output of segmentation-free KWS systems. Our approach operates in three stages: (1) decoding the retrieved word images into text using a neural decoder; (2) projecting the transcriptions into a semantic space using pre-trained transformer-based language models such as RoBERTa, MPNet, and MiniLM, where semantic similarity is measured by cosine distance; (3) re-ranking the retrieved items by combining visual and semantic similarity. We evaluate our method on the widely used historical George Washington (GW) dataset and the modern IAM Handwriting Database (IAM), using the retrieval ranked lists of two cutting-edge segmentation-free KWS baseline models. We further assess the performance across two decoder architectures and two naive fusion strategies through an extensive ablative analysis. Numerical results show consistent improvements in Mean Average Precision (mAP) across all tested configurations, with gains of up to +2.3% (from 94.31% to 96.59%) on GW and +3% (from 79.15% to 82.12%) on IAM. Notably, even in scenarios with minimal mAP improvement, we observe significant qualitative gains: semantically relevant but inexact matches are retrieved more frequently. This behavior, known as semantic KWS, is particularly beneficial in real-world scenarios wherein users may not know beforehand the precise query needed to locate relevant content. These findings demonstrate the effectiveness of incorporating semantic feedback from large language models into visual keyword spotting pipelines. By complementing appearance-based retrieval with NLP-driven semantic re-ranking, our approach enables more flexible and meaningful document search, even in challenging segmentation-free settings. Moreover, it highlights the potential of hybrid vision-language models to advance document image analysis, especially for noisy, heterogeneous, or low-resource historical archives.	en
heal.advisorName	Nikou, Christophoros	en
heal.committeeMemberName	Lykas, Aristeidis	en
heal.committeeMemberName	Blekas, Konstantinos	en
heal.academicPublisher	Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή. Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής	el
heal.academicPublisherID	uoi	el
heal.numberOfPages	79	el
heal.fullTextAvailability	true	-
Appears in Collections:	Διατριβές Μεταπτυχιακής Έρευνας (Masters) - ΜΗΥΠ

Show simple item record

Files in This Item:

File	Description	Size	Format
MSc Papazis 2025.pdf	Master thesis	3.55 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License

Repository of UOI "Olympias"