Handling of schema evolution in machine learning pipelines

Paligiannis, Athanasios; Παληγιάννης, Αθανάσιος

Please use this identifier to cite or link to this item: https://olympias.lib.uoi.gr/jspui/handle/123456789/31842

Full metadata record

DC Field	Value	Language
dc.contributor.author	Paligiannis, Athanasios	en
dc.contributor.author	Παληγιάννης, Αθανάσιος	el
dc.date.accessioned	2022-07-06T09:36:30Z	-
dc.date.available	2022-07-06T09:36:30Z	-
dc.identifier.uri	https://olympias.lib.uoi.gr/jspui/handle/123456789/31842	-
dc.identifier.uri	http://dx.doi.org/10.26268/heal.uoi.11657	-
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Schema evolution	en
dc.subject	Machine learning	en
dc.subject	Databases	en
dc.subject	Software technology	en
dc.subject	Εξέλιξη σχήματος	el
dc.subject	Μηχανική μάθηση	el
dc.subject	Βάσεις δεδομένων	el
dc.subject	Τεχνολογία λογισμικού	el
dc.title	Handling of schema evolution in machine learning pipelines	en
dc.title	Διαχείριση εξέλιξης σχήματος για διοχετεύσεις μηχανικής μάθησης	el
heal.type	masterThesis	-
heal.type.en	Master thesis	en
heal.type.el	Μεταπτυχιακή εργασία	el
heal.classification	Machine learning	-
heal.dateAvailable	2022-07-06T09:37:30Z	-
heal.language	el	-
heal.access	free	-
heal.recordProvider	Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή. Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής	el
heal.publicationDate	2022	-
heal.bibliographicCitation	Βιβλιογραφία: σ. 52-53	el
heal.abstract	Recently, data analysis and machine learning are increasing rapidly, as they are widely used in industry. Machine learning pipelines is a new trend in artificial intelligence whose purpose is to train automatically the model based on a certain input, via a succession of machine learning and data transformation steps.In this context, Apache Spark with its MLib library, is a powerful tool for data analysis, used to train models via machine learning pipelines. The main idea of machine learning pipelines is to compose a sequence of steps, called stages into a linear workflow that ultimately either trains or fits a model to the incoming data, and can be later deployed to work with more incoming data. These pipelines are constructed programmatically, in a host language (in our case: Java) and allow the automated execution of the entire pipeline via a single program invocation. The current thesis deals with machine learning pipelines from the viewpoint of software engineering, and provides two contributions. First, we take the source code of the pipelines and provide an abstraction of it, as main-memory structures. This is achieved by exploiting the Abstract Syntax Tree of the source code and extracting the necessary statements that define stages and pipelines, out of the entire set of statements that constitute the source code. This is facilitated by exploiting the Eclipse JDT API, which is tailored exactly for analyzing source code via its Abstract Syntax Tree. The outcome of this process is a main-memory representation of the pipeline as a graph, with stages as its nodes and input-output dependencies between subsequent stages as its edges. This also allows the automation of the graphical representation of all machine learning pipelines found in the source code of a certain project. Second, by exploiting this graph representation, we can assess the impact of schema evolution in a set of pipelines that all stem from the same data provider (e.g., a data file with structured records). Assuming that the structure of the records of a file is changed, with certain attributes being inserted and other attributes being deleted with respect to its previous version, we exploit the dependencies that exist between stages of pipeline and the source data, and propagate the impact of the change over the graph. Moreover, we also visualize the impact of the change in the graphical representation of the pipeline.	en
heal.abstract	Στο προγραμματιστικό περιβάλλον Apache Spark, μια διοχέτευση μηχανικής μάθησης (machine learning pipeline) είναι μια ροή εργασίας που χρησιμοποιείται για να παράγει ένα μοντέλο μηχανικής μάθησης μέσω σειράς επί μέρους βημάτων τα οποία συντίθεται για την παραγωγή του τελικού αποτελέσματος. Η σύνθεση αυτή γίνεται προγραμματιστικά, μέσω των σχετικών τύπων δεδομένων που προσφέρει το περιβάλλον Spark. Λόγω της δημοφιλίας του περιβάλλοντος Spark, οι διοχετεύσεις αυτές θα γίνονται όλο και πιο ευρέως χρησιμοποιούμενες στο μέλλον. Ο στόχος της παρούσας εργασίας δεν αφορά την ουσία της μηχανικής μάθησης, αλλά την αντιμετώπιση της ενσωμάτωσης της ροής εργασίας προγραμματικά εντός του πηγαίου κώδικα (καθώς ένα σύνολο από διοχετεύσεις μηχανικής μάθησης ορίζεται και συντίθεται με προγραμματιστικό τρόπο ο οποίος δεν ακολουθεί υποχρεωτικά κάποια συγκεκριμένη δομή). Συγκεκριμένα, ο στόχος της εργασίας είναι η επεξεργασία πηγαίων αρχείων κώδικα, με στόχο μια αφαιρετική τους ανα- παράσταση. Ο μηχανισμός που εισάγουμε είναι ένας μηχανισμός επεξεργασίας των αρχείων που οδηγεί σε μια αναπαράσταση Abstract Syntax Trees, γνωστά από το χώρο των μεταφραστών λογισμικού. Η αναπαράσταση του κώδικα σε ένα υψηλότερο επίπεδο αφαίρεσης, επιτρέπει τη διαχείρισή του ως ένα γράφημα, με κόμβους τα επί μέρους βήματα της ροής εργασίας και ακμές τις διοχετεύσεις δεδομένων.	el
heal.advisorName	Βασιλειάδης, Παναγιώτης	el
heal.committeeMemberName	Βασιλειάδης, Παναγιώτης	el
heal.committeeMemberName	Πιτουρά, Ευαγγελία	el
heal.committeeMemberName	Ζάρρας, Απόστολος	el
heal.academicPublisher	Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή. Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής	el
heal.academicPublisherID	uoi	-
heal.numberOfPages	53 σ.	-
heal.fullTextAvailability	true	-
Appears in Collections:	Διατριβές Μεταπτυχιακής Έρευνας (Masters) - ΜΗΥΠ

Show simple item record

Files in This Item:

File	Description	Size	Format
Μ.Ε. ΠΑΛΗΓΙΑΝΝΗΣ ΑΘΑΝΑΣΙΟΣ 2022.pdf		3.01 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License

Repository of UOI "Olympias"