SISTEM ZA OBUHVAT I OBRADU PODATAKA IZ HETEROGENIH IZVORA PODATAKA I NJIHOVO SKLADIŠTENJE U JEZERU PODATAKA

Milorad Trninić

doi:10.24867/10BE39Trninic

Electrotechnical and Computer Engineering

Vol. 35 No. 11 (2020): Proceedings of the Faculty of Technical Sciences

SYSTEM FOR ACQUISITION AND PROCESSING OF DATA FROM HETEROGENEOUS DATA SOURCES AND ITS PERSISTENCE IN A DATA LAKE

Milorad Trninić

10BE39Trninic.pdf (Serbian)

DOI:: https://doi.org/10.24867/10BE39Trninic
Submitted: November 5, 2020
Published: 2020-11-05

Abstract

In this paper, a system for acquisition and processing of data from heterogeneous data sources is presented. Design of this system is motivated by the use of the big data sets in training of machine learning models. Quality of the trained models is directly proportional with data volume and variety. System supports extensibility and scalability of the components in order to meet the needs of processing big data sets which have various structures. All of the acquired data is persisted in a data lake with an unaltered structure. Data processing transforms acquired data to suit the client’s needs. System implemented in this paper is a proof of concept for acquisition, persistence and processing of big data sets with the goal of preparing the data for training of machine learning models.

References

[1] Yoni Iny, “Upsolver - Technical Whitepaper: The Modern Data Lake Architecture”, 2019
[2] https://kafka.apache.org/documentation/ (pristupljeno u avgustu 2020.)
[3] https://spark.apache.org/ (pristupljeno u julu 2020.)
[4] Tom White, Hadoop: The Definitive Guide, Fourth Edition, O'Reilly Media, Inc., 2009
[5] Kristina Chodorow, Michael Dirolf, MongoDB: The Definitive Guide, O'Reilly Media, Inc., 2015
[6] https://spark.apache.org/docs/latest/ml-guide.html (pristupljeno u avgustu 2020.)
[7] https://en.wikipedia.org/wiki/Natural_language_processing#Common_NLP_Tasks (pristupljeno u septembru 2020.)