SYSTEM FOR ACQUISITION AND PROCESSING OF DATA FROM HETEROGENEOUS DATA SOURCES AND ITS PERSISTENCE IN A DATA LAKE

Authors

  • Milorad Trninić Autor

DOI:

https://doi.org/10.24867/10BE39Trninic

Keywords:

Big Data, distributed information systems, ETL

Abstract

In this paper, a system for acquisition and processing of data from heterogeneous data sources is presented. Design of this system is motivated by the use of the big data sets in training of machine learning models. Quality of the trained models is directly proportional with data volume and variety. System supports extensibility and scalability of the components in order to meet the needs of processing big data sets which have various structures. All of the acquired data is persisted in a data lake with an unaltered structure. Data processing transforms acquired data to suit the client’s needs. System implemented in this paper is a proof of concept for acquisition, persistence and processing of big data sets with the goal of preparing the data for training of machine learning models.

References

[1] Yoni Iny, “Upsolver - Technical Whitepaper: The Modern Data Lake Architecture”, 2019
[2] https://kafka.apache.org/documentation/ (pristupljeno u avgustu 2020.)
[3] https://spark.apache.org/ (pristupljeno u julu 2020.)
[4] Tom White, Hadoop: The Definitive Guide, Fourth Edition, O'Reilly Media, Inc., 2009
[5] Kristina Chodorow, Michael Dirolf, MongoDB: The Definitive Guide, O'Reilly Media, Inc., 2015
[6] https://spark.apache.org/docs/latest/ml-guide.html (pristupljeno u avgustu 2020.)
[7] https://en.wikipedia.org/wiki/Natural_language_processing#Common_NLP_Tasks (pristupljeno u septembru 2020.)

Published

2020-11-05

Issue

Section

Electrotechnical and Computer Engineering