Speeding up PDF indexing in Alfresco 3.3

I work a lot with PDF files and have noticed that Alfresco is really slow with indexing them. It’s not Alfresco’s fault per se but rather fault of the underlying library (PDFBox) which extracts text from PDF documents which is then indexed by Lucene. PDF is a format which sometimes makes it really hard to correctly extract text from a document, not only content is often compressed but also PDF is a subset of PostScript language and as every programming language PostScript can generate text in not organized or logical way (for example, sometimes PDF can output text on a bottom of page and then on top). That’s why the whole problem is not trivial.

PDFBox used by Alfresco is written in Java which slows down the whole thing even more. Hopefully there are solutions which can speed it up noticeably.

The idea of speeding up the indexer is not mine, original idea was described on a Think Alfresco blog but unfortunately the example configuration code which can be found there does not work with more recent Alfresco versions. I have updated it and now it works correctly with Alfresco 3.3 and newer.

Solution is quite easy, all we have to do is to define a new transformer which will be using xpdf’s pdftotext executable to extract text. While this may sound “hacky” it is not, there are other format transformers in Alfresco which work in very similar way (for example ImageMagick is used for some image related transformations). And it is important to stress out that pdftotext is REALLY much faster than PDFBox.

For example, my “reference” large PDF document which has 70MB and 13700 pages is processed in 30 seconds by pdftotext and PDFBox needs 20 minutes.

OK, so how to do it? It is actually very easy, all we have to do is to undefine existing PDFBox transformer Spring bean and define a new one which will be invoking pdftotext executable. In practice all you have to do is to take the Spring XML file as shown below and copy it to $ALFRESCO_HOME/tomcat/shared/classes/alfresco/extension/pdf-indexer-extract-content-context.xml and restart the Tomcat server.

[cc lang=”xml”]

${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext -enc UTF-8 ${source} ${target}

${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext.exe -enc UTF-8 ${source} ${target}

chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux

cmd.exe /C dir

application/pdf text/plain

transformer.worker.PdfToTextTool org.alfresco.repo.content.transform.ContentTransformerWorker



Good luck! Let me know if this was useful for you!