SYSTEM FOR ACQUISITION AND PROCESSING OF DATA FROM HETEROGENEOUS DATA SOURCES AND ITS PERSISTENCE IN A DATA LAKE
DOI:
https://doi.org/10.24867/10BE39TrninicKeywords:
Big Data, distributed information systems, ETLAbstract
In this paper, a system for acquisition and processing of data from heterogeneous data sources is presented. Design of this system is motivated by the use of the big data sets in training of machine learning models. Quality of the trained models is directly proportional with data volume and variety. System supports extensibility and scalability of the components in order to meet the needs of processing big data sets which have various structures. All of the acquired data is persisted in a data lake with an unaltered structure. Data processing transforms acquired data to suit the client’s needs. System implemented in this paper is a proof of concept for acquisition, persistence and processing of big data sets with the goal of preparing the data for training of machine learning models.
References
[2] https://kafka.apache.org/documentation/ (pristupljeno u avgustu 2020.)
[3] https://spark.apache.org/ (pristupljeno u julu 2020.)
[4] Tom White, Hadoop: The Definitive Guide, Fourth Edition, O'Reilly Media, Inc., 2009
[5] Kristina Chodorow, Michael Dirolf, MongoDB: The Definitive Guide, O'Reilly Media, Inc., 2015
[6] https://spark.apache.org/docs/latest/ml-guide.html (pristupljeno u avgustu 2020.)
[7] https://en.wikipedia.org/wiki/Natural_language_processing#Common_NLP_Tasks (pristupljeno u septembru 2020.)