MIKROSERVIS ZA EKSTRAKCIJU TEKSTA IZ WORD I PDF DOKUMENATA

Dejan Bešić

doi:10.24867/13BE26Besic

Electrotechnical and Computer Engineering

Vol. 36 No. 07 (2021): Proceedings of the Faculty of Technical Sciences

MICROSERVICE FOR TEXT EXTRACTION FROM WORD AND PDF DOCUMENTS

Dejan Bešić

13BE26Besic.pdf (Serbian)

DOI:: https://doi.org/10.24867/13BE26Besic
Submitted: July 4, 2021
Published: 2021-07-04

Abstract

This paper will describe the solution of text extraction from documents in Word and PDF format. The solution will be created as a microservice architecture. In addition to the implementation of the solution itself, the libraries required for extraction as well as conversion will be discussed. This paper will describe the structure of PDF documents, why they are in us and what are possible disadvantages of text extraction from this type of documents. The problem of extracting text from PDF and Word documents, will be reduced to the problem of extracting text from PDF documents.

References

[1] Servisno orijentisana arhitektura i integrisanje poslovnih aplikacija. Preuzeto sa https://www2.masfak.ni.ac.rs/uploads/articles/www2_5._soa_skraceno.pdf
[2] Servisno-orijentisana arhitektura, IBM, https://www.ibm.com/cloud/learn/soa
[3] Microservice Architecture, https://microservices.io/patterns/microservices.html
[4] Apache POI, https://en.wikipedia.org/wiki/Apache_POI
[5] Gotenberg, https://thecodingmachine.github.io/gotenberg
[6] Apache PDFBox, https://en.wikipedia.org/wiki/Apache_PDFBox