MICROSERVICE FOR TEXT EXTRACTION FROM WORD AND PDF DOCUMENTS
DOI:
https://doi.org/10.24867/13BE26BesicKeywords:
Microservice, extraction, text, Word, PDF, conversionAbstract
This paper will describe the solution of text extraction from documents in Word and PDF format. The solution will be created as a microservice architecture. In addition to the implementation of the solution itself, the libraries required for extraction as well as conversion will be discussed. This paper will describe the structure of PDF documents, why they are in us and what are possible disadvantages of text extraction from this type of documents. The problem of extracting text from PDF and Word documents, will be reduced to the problem of extracting text from PDF documents.
References
[1] Servisno orijentisana arhitektura i integrisanje poslovnih aplikacija. Preuzeto sa https://www2.masfak.ni.ac.rs/uploads/articles/www2_5._soa_skraceno.pdf
[2] Servisno-orijentisana arhitektura, IBM, https://www.ibm.com/cloud/learn/soa
[3] Microservice Architecture, https://microservices.io/patterns/microservices.html
[4] Apache POI, https://en.wikipedia.org/wiki/Apache_POI
[5] Gotenberg, https://thecodingmachine.github.io/gotenberg
[6] Apache PDFBox, https://en.wikipedia.org/wiki/Apache_PDFBox
[2] Servisno-orijentisana arhitektura, IBM, https://www.ibm.com/cloud/learn/soa
[3] Microservice Architecture, https://microservices.io/patterns/microservices.html
[4] Apache POI, https://en.wikipedia.org/wiki/Apache_POI
[5] Gotenberg, https://thecodingmachine.github.io/gotenberg
[6] Apache PDFBox, https://en.wikipedia.org/wiki/Apache_PDFBox
Downloads
Published
2021-07-04
Issue
Section
Electrotechnical and Computer Engineering