Stamper is ready!

Stamper is ready. Finally 🙂 I would like to announce that the first version of an extension for Alfresco for watermarking and securing PDF documents is available!

Stamper is a quite extensive tool for Alfresco which simplifies watermarking of PDF documents. Basically any form of watermarks is available, not only static watermarks (like JPG, PNG etc. images) are available but also dynamic watermarks (called layers). Layers are SVG files and can change each time you view a document because layers contain control sequences (variables) which will be replaced by their real values upon viewing.

Stamper is not only about watermarking. It can also sign PDF documents and generate electronic signature pages (even with scanned signatures!).

And of course Stamper is quite nicely integrated with Alfresco. It means that you can call it from Alfresco Java API, or JavaScript API or simply use it from Alfresco Explorer GUI.

CMIS support for Stamper as well as Nuxeo version of Stamper is on it’s way!

You can find out more about it on Stamper’s website: stamper.metasys.pl

Speeding up PDF indexing in Alfresco 3.3

I work a lot with PDF files and have noticed that Alfresco is really slow with indexing them. It’s not Alfresco’s fault per se but rather fault of the underlying library (PDFBox) which extracts text from PDF documents which is then indexed by Lucene. PDF is a format which sometimes makes it really hard to correctly extract text from a document, not only content is often compressed but also PDF is a subset of PostScript language and as every programming language PostScript can generate text in not organized or logical way (for example, sometimes PDF can output text on a bottom of page and then on top). That’s why the whole problem is not trivial.

PDFBox used by Alfresco is written in Java which slows down the whole thing even more. Hopefully there are solutions which can speed it up noticeably.

The idea of speeding up the indexer is not mine, original idea was described on a Think Alfresco blog but unfortunately the example configuration code which can be found there does not work with more recent Alfresco versions. I have updated it and now it works correctly with Alfresco 3.3 and newer.

Solution is quite easy, all we have to do is to define a new transformer which will be using xpdf’s pdftotext executable to extract text. While this may sound “hacky” it is not, there are other format transformers in Alfresco which work in very similar way (for example ImageMagick is used for some image related transformations). And it is important to stress out that pdftotext is REALLY much faster than PDFBox.

For example, my “reference” large PDF document which has 70MB and 13700 pages is processed in 30 seconds by pdftotext and PDFBox needs 20 minutes.

OK, so how to do it? It is actually very easy, all we have to do is to undefine existing PDFBox transformer Spring bean and define a new one which will be invoking pdftotext executable. In practice all you have to do is to take the Spring XML file as shown below and copy it to $ALFRESCO_HOME/tomcat/shared/classes/alfresco/extension/pdf-indexer-extract-content-context.xml and restart the Tomcat server.

[cc lang=”xml”]









${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext -enc UTF-8 ${source} ${target}


${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext.exe -enc UTF-8 ${source} ${target}


chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux



cmd.exe /C dir

application/pdf text/plain

transformer.worker.PdfToTextTool org.alfresco.repo.content.transform.ContentTransformerWorker



application/pdf


[/cc]

Good luck! Let me know if this was useful for you!