Speeding up PDF indexing in Alfresco 3.3

I work a lot with PDF files and have noticed that Alfresco is really slow with indexing them. It’s not Alfresco’s fault per se but rather fault of the underlying library (PDFBox) which extracts text from PDF documents which is then indexed by Lucene. PDF is a format which sometimes makes it really hard to correctly extract text from a document, not only content is often compressed but also PDF is a subset of PostScript language and as every programming language PostScript can generate text in not organized or logical way (for example, sometimes PDF can output text on a bottom of page and then on top). That’s why the whole problem is not trivial.

PDFBox used by Alfresco is written in Java which slows down the whole thing even more. Hopefully there are solutions which can speed it up noticeably.

The idea of speeding up the indexer is not mine, original idea was described on a Think Alfresco blog but unfortunately the example configuration code which can be found there does not work with more recent Alfresco versions. I have updated it and now it works correctly with Alfresco 3.3 and newer.

Solution is quite easy, all we have to do is to define a new transformer which will be using xpdf’s pdftotext executable to extract text. While this may sound “hacky” it is not, there are other format transformers in Alfresco which work in very similar way (for example ImageMagick is used for some image related transformations). And it is important to stress out that pdftotext is REALLY much faster than PDFBox.

For example, my “reference” large PDF document which has 70MB and 13700 pages is processed in 30 seconds by pdftotext and PDFBox needs 20 minutes.

OK, so how to do it? It is actually very easy, all we have to do is to undefine existing PDFBox transformer Spring bean and define a new one which will be invoking pdftotext executable. In practice all you have to do is to take the Spring XML file as shown below and copy it to $ALFRESCO_HOME/tomcat/shared/classes/alfresco/extension/pdf-indexer-extract-content-context.xml and restart the Tomcat server.

[cc lang=”xml”]









${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext -enc UTF-8 ${source} ${target}


${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext.exe -enc UTF-8 ${source} ${target}


chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux



cmd.exe /C dir

application/pdf text/plain

transformer.worker.PdfToTextTool org.alfresco.repo.content.transform.ContentTransformerWorker



application/pdf


[/cc]

Good luck! Let me know if this was useful for you!

8 thoughts on “Speeding up PDF indexing in Alfresco 3.3

  1. Hi Ken,

    That’s bizarre, I have just double checked everything with freshly installed Alfresco Community 3.4.d (Java version v1.6.0_18-b07) and the latest pdftotext and it works just fine, operating system I have right now is Windows XP but it worked fine also on Linux. Can you paste the server log?

    Karol

  2. Hi, thanks for your contribution.

    I tried your solution in Alfresco-3.4.d but I can get it to work and I hope you could help me and give me a hint, because I really stucked.

    After adding your XML to $ALFRESCO_HOME/tomcat/shared/classes/alfresco/extension, Alfresco “explodes” and it’s not able to run. I saw in alfresco.log these lines, but I can’t guess any solution:

    Configuration problem: Failed to import bean definitions from URL location [classpath*:alfresco/extension/*-context.xml]
    Offending resource: class path resource [alfresco/application-context.xml]; nested exception is org.springframework.beans.factory.xml.XmlBeanDefinitionStoreException: Line 12 in XML document from file [/opt/alfresco-3.4.d/tomcat/shared/classes/alfresco/extension/pdf-indexer-extract-content-context.xml] is invalid; nested exception is org.xml.sax.SAXParseException: The content of element type “property” must match “(description?,(bean|ref|idref|value|null|list|set|map|props)?)”.

    What i have at lines 10,11 and 12 is:
     
         
       

    As far as I know this error is related to Spring framework (I’ve never worked with Spring) but I can’t see where the real problem is, beacuse all seems OK in the XML I downloaded from your site.

    Thanks in advance.

  3. Karol, I can’t thank you enough.

    You saved my Alfresco / Share installation. I was so fed up with the pain and downtime that pdf files going through pdfbox were causing, that I planned on finding replacement software for Alfresco and Share.

    I was going on 3 hours of down time for this go-round when I found your post. The pdfbox that came with the alfresco installation had been replaced with a current version long ago after attempting to access an alfresco share site with a pdf file in it would immediately bring down java and tomcat. However, the newer versions still didn’t perform well enough to be used.
    We have a lot of architectural, mechanical and HVAC, and electrical building drawings and plans. On upload, these files would choke pdfbox – the Share uploader would report that the upload failed and the whole system would be dead slow or even inaccessible for sometimes an hour or more while java/pdfbox chewed the processor.

    After using this fix, I crossed my fingers and restarted Alfresco and Share. It started normally and I could see pdftotext spawning, doing its thing and closing. Yeah, it worked!

    For information: Ubuntu server 10.04, Alfresco 3.4 d.
    I used apt to install xpdf-utils, created a webapps/alfresco/WEB-INF/bin/ directory with symlink to the /usr/bin/pdftotext binary, and used your xml file as instructed. I did have to modify line 38 (my file was pdftotext rather than pdftotext-linux).

    I really appreciate your effort. I’m also glad that I stumbled here because the pdf stamper that you’ve made looks like something that we might be interested in.

    • Hi JAO

      I am glad that it was useful for you! 🙂

      If you have any questions regarding Stamper let me know via email. I will be happy to answer them.

      Karol

Leave a Reply

Your email address will not be published. Required fields are marked *