cmis-in-batch released with data generation mode

I have just pushed proper 1.0 release of the cmis-in-batch tool to github and bintray:

https://github.com/karolbe/cmis-in-batch

https://bintray.com/karolbe/metasys/cmis-in-batch

Here is a quote from the README.md:

Data generation is a useful feature that allows bulk importing of test data documents into CMIS compatible repository. Additionally it can populate metadata of documents with values coming from predefined dictionaries.

Sample script for generating thousands of documents can look like this one below.

Here is a brief description of what the script does:

* it will load three dictionaries from files /tmp/disciplines, /tmp/types, /tmp/subtypes. The dictionaries are simple text files where values are separated by new line characters. From the dictionary values Cartesian product will be calculated so for example, having three dictionaries:

1. level1A, level1B
2. level2A, level2B
3. level3A, level3B
following combinations will be generated:


[level1A, level2A, level3A]
[level1A, level2A, level3B]
[level1A, level2B, level3A]
[level1A, level2B, level3B]
[level1B, level2A, level3A]
[level1B, level2A, level3B]
[level1B, level2B, level3A]
[level1B, level2B, level3B]

* it will import each file in content-path location (“/media/kbryd/Media/work/sample_data/department”) to a location in repository defined with linking-rule: /Repository/${discipline}/static/${doctype}/sub/${docsubtype} – each ${} variable will be replaced by a value coming from appropriate dictionary.
* naming-rule defines what the object name should be. It can use variables from the dictionaries plus a few additional: ${file_name}, ${file_size}, ${file_path}, ${file_ext}, ${file_mime}
* mapping defines the mapping for populating metadata of each document, e.g. in this case discipline attribute will be populated with value of discipline.


   generate-random-data "set1" {
   doc-type "cara_document"
   linking-rule "/Repository/${discipline}/static/${doctype}/sub/${docsubtype}"
   naming-rule "${file_name} - ${doctype}"
   content-path "/media/kbryd/Media/work/sample_data/department"

   mapping {
       discipline {
          "${discipline}"
       }
       doc_type {
          "${doctype}"
       }
       doc_subtype {
          "${docsubtype}"
       }
   }

   dictionaries {
      discipline "/tmp/disciplines"
      doctype "/tmp/types"
      docsubtype "/tmp/subtypes"
   }
}

And that’s all! Have fun using it! 🙂

Simple CMIS Export tool

It is hard to believe that there was no basic (well, even extremely basic) tool that would allow exporting some folders and documents from a CMIS repository (Alfresco in my case) to a file-system in a hassle free way. Thanks to the OpenCMIS library writing such tool took around one hour and here is the result:

https://github.com/karolbe/cmis-export-tool

Currently the tool accepts following arguments:

usage: com.metasys.CMISExportTool
 -h                           Print help for this application
 -f <arg>                     Destination folder location
 -u <arg>                     User login
 -p <arg>                     Password
 -levels <number of levels>   Number of levels
 -s,--starting-path <arg>     Start path

-levels argument is not yet supported, but will be soon.

This tool can be very useful when used together with cmis-upload-maven-plugin (https://github.com/karolbe/cmis-upload-maven-plugin) for writing Unit tests. For example, you can export some files (e.g. configuration) from a repository that are required by your unit test, then you put them in your AMP project and, finally, using cmis-upload-maven-plugin you can automatically upload those files to your test repository that it started during execution of your unit tests (mvn test).

So here is an example. Let’s assume that you have a project that needs some configuration in the repository, let’s name it ‘stamper’. In order to make the bootstrap process more convenient you need to add following section to your pom.xml file:


        <profile>
            <id>upload-config</id>
            <activation>
                <property>
                    <name>action</name>
                    <value>upload-config</value>
                </property>
            </activation>

            <build>
                <plugins>
                    <plugin>
                        <groupId>com.metasys</groupId>
                        <artifactId>cmis-upload-maven-plugin</artifactId>
                        <version>1.0-SNAPSHOT</version>
                        <configuration>
                            <localPath>${project.build.directory}/stamper-resources</localPath>
                            <destPath>/</destPath>
                            <overwrite>true</overwrite>
                            <username>admin</username>
                            <password>admin</password>
                            <url>http://localhost:8080/alfresco/cmisatom</url>
                        </configuration>
                        <executions>
                            <execution>
                                <phase>package</phase>
                                <goals>
                                    <goal>cmis-upload</goal>
                                </goals>
                            </execution>
                        </executions>
                    </plugin>
                </plugins>
            </build>
        </profile>

This will upload all files from ${project.build.directory}/stamper-resources folder to the root folder of your repository. You could copy some files there using maven.resources.plugin, so let’s do it…

This will copy files from main/src/test/config to target/stamper-resources.


           <plugin>
                <artifactId>maven-resources-plugin</artifactId>
                <version>2.7</version>
                <executions>
                    <execution>
                        <id>copy-cara-resources</id>
                        <phase>validate</phase>
                        <goals>
                            <goal>copy-resources</goal>
                        </goals>
                        <configuration>
                            <outputDirectory>${project.build.directory}/stamper-resources</outputDirectory>
                            <resources>
                                <resource>
                                    <directory>src/test/config</directory>
                                    <filtering>false</filtering>
                                </resource>
                            </resources>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

and now all you have to do to upload files is it type:

mvn -Dmaven.test.skip=true -Daction=upload-config package

Enjoy! 🙂

cmis-in-batch

I have just uploaded an initial version of tool which tries to simplify data import and manipulation in CMIS enabled repositories.

More details as well as source code (under Apache License 2.0) are available here:
https://code.google.com/p/cmis-in-batch/

It is still Work In Progress but it works, at least now, with Alfresco. I have just finished installation of Documentum 6.7 with newly added CMIS support and I am going to test it as soon as possible.

Data importer for CMIS repositories

When developing Stamper I missed a tool which would allow me to easily import some files into a repository. Alfresco provides import/export functionality through ACP files which can be used for that purpose but what about other repositories? What about Nuxeo, Documentum? Documentum has Composer/Application Builder, Nuxeo perhaps also has it’s own mechanisms, but I wanted one tool which would work with all those repositories without need to create a separate installation package for each system.

Hopefully there is a CMIS standard which can make this happen.

CMISetuper (working name :-)) will be a tool which will be able to:

  • connect to any repository which supports CMIS. This will be possible thanks to the OpenCMIS library.
  • import files
  • create folders
  • validate presence of types, objects, folders
  • modify content (replace, version etc.)
  • link and unlink objects from folders

Those actions will be described by a declarative language named SDL – Simple Declarative Language. Below is a fragment of a script:


execute {
    import-files "/myFiles" {
        "image1.jpg" cm_name="Some title" metasys_taken_on_date=2009/10/05
        "image2.jpg" metasys_color=false metasys_taken_on_date=2011/01/05
    }

    import-file "/Documents" "someotherfile.doc"
    delete-file "/Old document.doc" all-versions="true"

    replace-content "/Pictures/picture.jpg" "/tmp/newContent.jpg"
    update-properties "/Documents/someotherfile.doc" cm_title="another title"
    link-to-folder "/myHouse.jpg" "/myFiles"
    unlink-from-folder "myHouse.jpg" "/myFiles"
}

Basically the tool will execute scripts in a few stages:

  • pre-validate stage – this stage can check presence of types, files, folders etc. in repository and fail when they are present (or not).
  • prepare stage – at this stage some preparations can take place, for example creation of folders for which objects will be imported.
  • execute stage – this is the main stage which takes care of actual data importing, linking etc.
  • validation stage – the last stage which checks whether everything was imported correctly

Some global settings will be available which will control overwrite mode as well as whether to stop on errors.
I have also plan to implement rollback mode but perhaps not in the first version 🙂

At this moment I would say that around 60% of above functionality is implemented and I hope to have rest ready by end of this week, so stay tuned for updates!

If you have some new ideas please share them with me.

Writing Mavenized JUnit tests in Alfresco.

Writing JUnit tests in Alfresco projects using Maven is not very straightforward. The main problem is lack of a “parent POM” which would gather all Alfresco dependencies into one convenient package. I will show you how to create such parent POM as well as how to create a simple JUnit test project.

Prerequisites are:

  • Alfresco (I have used 3.3g)
  • Python
  • Maven (of course :))

We will use dependencies from Alfresco WAR file, perhaps the right way is to get them from Alfresco SDK, but since I haven’t experienced any problems with using JARs from alfresco.war and since it is also easier to build pom.xml from alfresco.war then this is what I recommend.

So, to create our parent pom.xml with all Alfresco dependencies we will need a small script:


import os
import sys

files = os.listdir(sys.argv[1])

groupId="alfresco-sdk"
version="3.3g"
classifier="community"

print """
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metasys</groupId>
<artifactId>alfresco-sdk-parent</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>pom</packaging>
<name>alfresco-sdk-parent</name>
<dependencies>
"""

for file in files:
artifactId = file[:-4]
c = "mvn install:install-file -Dfile=" + os.path.join(sys.argv[1], file) + \
       " -DgroupId=" + groupId + \
       " -DartifactId=" + artifactId + \
       " -Dversion=" + version + \
       " -Dpackaging=jar -Dclassifier=" + classifier + \
       " -DgeneratePom=true -DcreateChecksum=true"
s = "\t<dependency>\n\t<scope>test</scope>\n\t<groupId>" + groupId + \
       "</groupId>\n\t<artifactId>"+ artifactId + \
       "</artifactId>\n\t<version>"+ version + \
       "</version>\n\t<classifier>community</classifier>\n\t</dependency>\n"
print s
if len(sys.argv) > 2 and sys.argv[2] == 'install':
      os.system(c)

print """
</dependencies>
</project>
"""

There are some hardcoded values in the script. Change them if you want, for example if you’re working with Alfresco 3.4 then change the version variable.
We will invoke it like this:


mkdir alfresco-parent-pom
cd alfresco-parent-pom
python alfresco-2-maven.py <Path to Alfresco's WEB-INF/lib folder> >pom.xml

Now, move the pom.xml to a newly created folder and install it using:


mvn install

At this moment we have a Maven project ready to use but we still don’t have dependencies in the Maven repository. To install them start the script again, this time with ‘install’ parameter:


python alfresco-2-maven.py <Path to Alfresco's WEB-INF/lib folder> install

This will take some time to finish but eventually you will end up with a local repository with all Alfresco dependencies.

No, let’s move on to a real project with some JUnit tests. Our test will have to start an Alfresco repository, each Alfresco repository apart from database and filestore needs configuration. You can find Alfresco configuration in alfresco/WEB-INF/classes/alfresco folder. In order to start the tests we will need that folder in our CLASSPATH or create a JAR file with configuration files. I prefer the second solution. So to prepare the configuration artifact follow these steps:


cd Alfresco/tomcat/webapps/alfresco/WEB-INF/classes
zip -r /tmp/config.jar .
mvn install:install-file -Dfile=config.jar -DgroupId=alfresco-sdk -DartifactId=config -Dversion=3.3g -Dpackaging=jar -Dclassifier=community -DgeneratePom=true -DcreateChecksum=true

You can cleanup the config.jar before installing it in the local Maven repository. If you want to have multiple configurations then just modify the classifier parameter or simply use different version modifier.

Now, finally, we can move on to our test class. We start with the pom.xml.


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.metasys</groupId>
    <artifactId>test-suite</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    <name>test-suite</name>

    <parent>
        <groupId>com.metasys</groupId>
        <artifactId>alfresco-sdk-parent</artifactId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <build>
        <resources>
            <resource>
                <filtering>false</filtering>
                <directory>src/test/resources</directory>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.5</source>
                    <target>1.5</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency>
            <scope>test</scope>
            <groupId>alfresco-sdk</groupId>
            <artifactId>config</artifactId>
            <version>3.3g</version>
        </dependency>
  <dependency>
            <groupId>javax</groupId>
            <artifactId>servlet</artifactId>
            <version>1.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.6</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>

As you can see apart from referencing parent POM I have also added Mysql JDBC driver and servlet-api as dependencies. These two jars are present in Alfresco SDK but not in alfresco.war and that’s why they were not picked up the Python script. Of course there is also the Alfresco Configuration dependency we created before.

Now, since we have the pom.xml let’s move on to creating the Java class, we will put it in a package named com.metasys.tests, so let’s create the folder structure first:


mkdir -p JUnitTest/src/test/java/com/metasys/tests
mkdir -p JUnitTest/src/test/resources/alfresco/extension
mkdir -p JUnitTest/src/test/resources/alfresco/desktop

Change folder to src/test/java/com/metasys/tests and create a new Java class:


package com.metasys.tests;

import org.alfresco.model.ContentModel;
import org.alfresco.repo.security.authentication.AuthenticationComponent;
import org.alfresco.service.ServiceRegistry;
import org.alfresco.service.cmr.action.ActionService;
import org.alfresco.service.cmr.repository.ContentWriter;
import org.alfresco.service.cmr.repository.NodeRef;
import org.alfresco.service.cmr.repository.StoreRef;
import org.alfresco.service.cmr.search.ResultSet;
import org.alfresco.service.cmr.search.SearchService;
import org.alfresco.util.BaseAlfrescoTestCase;
import org.junit.Test;

public class SimpleTest extends BaseAlfrescoTestCase {

    protected NodeRef companyHomeRef;
    protected NodeRef rootFolderTestRef;

    @Override
    protected void setUp() throws Exception {
        setUpContext();

        this.serviceRegistry = (ServiceRegistry) ctx.getBean(ServiceRegistry.SERVICE_REGISTRY);
        this.nodeService = serviceRegistry.getNodeService();
        this.contentService = serviceRegistry.getContentService();
        this.authenticationComponent = (AuthenticationComponent) ctx.getBean("authenticationComponent");
        this.actionService = (ActionService) ctx.getBean("actionService");
        this.transactionService = serviceRegistry.getTransactionService();

        authenticationComponent.setCurrentUser("admin");
        SearchService searchService = this.serviceRegistry.getSearchService();
        ResultSet rs = searchService.query(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE, SearchService.LANGUAGE_XPATH,
                "/app:company_home");
        if (rs.length() != 1) {
            fail("Could not find company home");
        }
        companyHomeRef = rs.getNodeRef(0);
    }

    @Test
    public void testSimpleWatermarking() throws Throwable {
        rootFolderTestRef = serviceRegistry.getNodeService().createNode(
                companyHomeRef, ContentModel.ASSOC_CONTAINS,
                ContentModel.TYPE_FOLDER, ContentModel.TYPE_FOLDER).getChildRef();
  assert(rootFolderTestRef != null);
        nodeService.setProperty(rootFolderTestRef, ContentModel.PROP_NAME, "TestObject");
  assert(nodeService.getProperty(rootFolderTestRef, ContentModel.PROP_NAME).equals("TestObject"));
    }
}

We will also need bootstrap configuration, normally these files are somewhere in Tomcat folder tree, since we’re not using Tomcat for starting up the repository we will have to find a place for them.

First file is dev-contex.xml, it has to be saved in src/test/resources/alfresco/extension:


<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
    <bean id="global-properties" class="org.alfresco.config.JndiPropertiesFactoryBean">
        <property name="locations">
            <list>
                <value>classpath:alfresco/repository.properties</value>
                <value>classpath:alfresco/domain/transaction.properties</value>
  <!-- <value>classpath:alfresco/jndi.properties</value> -->
  <!--  Overrides supplied by modules -->
                <value>classpath*:alfresco/module/*/alfresco-global.properties</value>
  <!--  Installer or user-provided defaults -->
                <value>classpath*:alfresco-global.properties</value>
                <value>classpath:alfresco/extension/dev.properties</value>
            </list>
        </property>
        <property name="systemPropertiesModeName">
            <value>SYSTEM_PROPERTIES_MODE_OVERRIDE</value>
        </property>
<!-- Extra properties that have no defaults that we allow to be defined through JNDI or System properties -->
        <property name="systemProperties">
            <list>
                <value>hibernate.dialect</value>
                <value>hibernate.query.substitutions</value>
                <value>hibernate.jdbc.use_get_generated_keys</value>
                <value>hibernate.default_schema</value>
            </list>
        </property>
    </bean>
</beans>

Next one is dev.properties, save it to the same folder as dev.properties:


dir.root=/home/kbryd/AlfrescoT2/alf_data
index.recovery.mode=AUTO
integrity.failOnError=true
db.name=alfrescoT2
db.username=alfrescoT2
db.password=alfrescoT2
db.host=localhost
db.port=3306
db.driver=org.gjt.mm.mysql.Driver
db.url=jdbc:mysql://${db.host}:${db.port}/${db.name}
hibernate.dialect=org.hibernate.dialect.MySQLInnoDBDialect

Obviously change the database name, username, password and path to the Alfresco filestore. The last (really!) missing bit is a package of Alfresco desktop files required during startup, just copy them from alfresco/WEB-INF/classes/alfresco/desktop to your project’s src/test/resources/alfresco/desktop folder.

Now, go to the root folder of the test project and use Maven to start it:


mvn test

It will start up the repository and then execute tests, you should see something like this:


Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.947 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

[INFO] [jar:jar {execution: default-jar}]
[INFO] Building jar: /tmp/test-project/target/test-suite-1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 44 seconds
[INFO] Finished at: Sat Feb 26 14:45:26 CET 2011
[INFO] Final Memory: 30M/258M
[INFO] ------------------------------------------------------------------------

One ending note. You may wonder why I am not calling super.setUp() in the Java class. The only reason I am doing it is because the implementation of setUp() creates a separate Store for each test. I don’t like this default behavior and that’s why I am calling only setUpContext() and then creating all required beans in my test class.

I hope that this is useful, please email me or leave comments if you have any questions!

Stamper is ready!

Stamper is ready. Finally 🙂 I would like to announce that the first version of an extension for Alfresco for watermarking and securing PDF documents is available!

Stamper is a quite extensive tool for Alfresco which simplifies watermarking of PDF documents. Basically any form of watermarks is available, not only static watermarks (like JPG, PNG etc. images) are available but also dynamic watermarks (called layers). Layers are SVG files and can change each time you view a document because layers contain control sequences (variables) which will be replaced by their real values upon viewing.

Stamper is not only about watermarking. It can also sign PDF documents and generate electronic signature pages (even with scanned signatures!).

And of course Stamper is quite nicely integrated with Alfresco. It means that you can call it from Alfresco Java API, or JavaScript API or simply use it from Alfresco Explorer GUI.

CMIS support for Stamper as well as Nuxeo version of Stamper is on it’s way!

You can find out more about it on Stamper’s website: stamper.metasys.pl

Speeding up PDF indexing in Alfresco 3.3

I work a lot with PDF files and have noticed that Alfresco is really slow with indexing them. It’s not Alfresco’s fault per se but rather fault of the underlying library (PDFBox) which extracts text from PDF documents which is then indexed by Lucene. PDF is a format which sometimes makes it really hard to correctly extract text from a document, not only content is often compressed but also PDF is a subset of PostScript language and as every programming language PostScript can generate text in not organized or logical way (for example, sometimes PDF can output text on a bottom of page and then on top). That’s why the whole problem is not trivial.

PDFBox used by Alfresco is written in Java which slows down the whole thing even more. Hopefully there are solutions which can speed it up noticeably.

The idea of speeding up the indexer is not mine, original idea was described on a Think Alfresco blog but unfortunately the example configuration code which can be found there does not work with more recent Alfresco versions. I have updated it and now it works correctly with Alfresco 3.3 and newer.

Solution is quite easy, all we have to do is to define a new transformer which will be using xpdf’s pdftotext executable to extract text. While this may sound “hacky” it is not, there are other format transformers in Alfresco which work in very similar way (for example ImageMagick is used for some image related transformations). And it is important to stress out that pdftotext is REALLY much faster than PDFBox.

For example, my “reference” large PDF document which has 70MB and 13700 pages is processed in 30 seconds by pdftotext and PDFBox needs 20 minutes.

OK, so how to do it? It is actually very easy, all we have to do is to undefine existing PDFBox transformer Spring bean and define a new one which will be invoking pdftotext executable. In practice all you have to do is to take the Spring XML file as shown below and copy it to $ALFRESCO_HOME/tomcat/shared/classes/alfresco/extension/pdf-indexer-extract-content-context.xml and restart the Tomcat server.

[cc lang=”xml”]









${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext -enc UTF-8 ${source} ${target}


${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext.exe -enc UTF-8 ${source} ${target}


chmod 775 ${catalina.base}/webapps/alfresco/WEB-INF/bin/pdftotext-linux



cmd.exe /C dir

application/pdf text/plain

transformer.worker.PdfToTextTool org.alfresco.repo.content.transform.ContentTransformerWorker



application/pdf


[/cc]

Good luck! Let me know if this was useful for you!