Please use this identifier to cite or link to this item:
https://olympias.lib.uoi.gr/jspui/handle/123456789/39749Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Πανάγος, Ιάσων - Ιωάννης | el |
| dc.date.accessioned | 2026-02-02T09:55:17Z | - |
| dc.date.available | 2026-02-02T09:55:17Z | - |
| dc.identifier.uri | https://olympias.lib.uoi.gr/jspui/handle/123456789/39749 | - |
| dc.rights | Attribution-NonCommercial-NoDerivs 3.0 United States | * |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/us/ | * |
| dc.subject | Speech recognition | en |
| dc.subject | Macine learning | en |
| dc.subject | Lip reading | en |
| dc.subject | Computer vision | en |
| dc.title | Lightweight Methods and Models for Practical Visual Speech Recognition from Video Sequences | en |
| dc.type | doctoralThesis | en |
| heal.type | doctoralThesis | el |
| heal.type.en | Doctoral thesis | en |
| heal.type.el | Διδακτορική διατριβή | el |
| heal.classification | Speech recognition | |
| heal.classification | Macine learning | |
| heal.classification | Lip reading | |
| heal.classification | Computer vision | |
| heal.dateAvailable | 2026-02-02T09:56:18Z | - |
| heal.language | en | el |
| heal.access | free | el |
| heal.recordProvider | Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή | el |
| heal.publicationDate | 2025 | - |
| heal.bibliographicCitation | Panagos, I. I. (2025). Lightweight Methods and Models for Practical Visual Speech Recognition from Video Sequences, [Doctoral dissertation], University of Ioannina. | en |
| heal.abstract | Visual Speech Recognition (VSR) is a computer vision problem that aims to decode spoken words of one or more speakers from visual media without the presence of sound. Applications of VSR are found in numerous domains, with profound impacts on various aspects of everyday life. A notable application lies in the field of accessibility in medicine, where a VSR system can assist individuals with speech impairments, significantly enhancing their quality of life. Other applications include, but are not limited to, video captioning, and personal security systems, each with their own value. While recently there has been a steady increase in research interest regarding VSR, the issue of practicality has not been adequately explored. More specifically, the proposed models and methods often fail to consider the computational costs associated with their architectures, which severely limits or outright prevents their applicability in real-world scenarios. In this dissertation, we focus on addressing this oversight by developing lightweight and efficient end-to-end models for practical Visual Speech Recognition of isolated words. To realize this objective, we explore a multitude of approaches to reduce network size and complexity using a wide variety of methods. Owing to these reduced hardware requirements, such models can be applied to a broader range of applications and cover a sizable amount of practical real-life scenarios, offering a series of benefits. The fundamental design of a VSR system follows a two-step structure that employs expensive components such as deep convolutional neural networks with large hardware overheads that are prohibitively expensive to deploy. Our goal is reducing these resource requirements while maintaining acceptable recognition rates. To that end, we first employ techniques that exploit efficient formulations and low-cost operations to shrink model sizes without severely compromise performance. We replace the standard, resource-intensive components in existing networks with more efficient ones, achieving significant reductions in model parameter counts as well as in computational complexity. Moreover, we design a lightweight temporal block blueprint that is flexible in its design and can be adapted to the resources at hand and use it to develop highly-efficient networks with minimal hardware demands. Next, we shift our attention to a more holistic approach, by designing a lightweight VSR model using efficient components. A systematic study is conducted evaluating multiple networks and structures for visual feature extraction as well as sequence modeling. We select the best-performing components and combine them in a unified end-to-end architecture that achieves very high recognition accuracy while being compact, outperforming all other lightweight approaches in the literature. Finally, using this model as a baseline, we explore techniques to improve its performance without raising its complexity, attempting to bridge the gap with larger models. To that end, we incorporate channel attention in its temporal blocks to enhance feature representation, while we refine its training process by introducing regularization that allows the networks to learn more descriptive features from the data. Finally, we combine these additions to achieve significant recognition uplifts without affecting the network overhead. | en |
| heal.advisorName | Νίκου, Χριστόφορος | el |
| heal.committeeMemberName | Νίκου, Χριστόφορος | el |
| heal.committeeMemberName | Σφήκας, Γεώργιος | el |
| heal.committeeMemberName | Κόντης, Λυσίμαχος | el |
| heal.committeeMemberName | Λύκας, Αριστείδης | el |
| heal.committeeMemberName | Μπλέκας, Κωνσταντίνος | el |
| heal.committeeMemberName | Κεσίδης, Αναστάσιος | el |
| heal.committeeMemberName | Κακογεωργίου, Ιωάννης | el |
| heal.academicPublisher | Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή. Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής | el |
| heal.academicPublisherID | uoi | el |
| heal.numberOfPages | 250 | el |
| heal.fullTextAvailability | true | - |
| Appears in Collections: | Διδακτορικές Διατριβές - ΜΗΥΠ | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| Δ.Δ. Πανάγος Ιάσων - Ιωάννης (2025).pdf | 1.8 MB | Adobe PDF | View/Open |
This item is licensed under a Creative Commons License