Clean and fast indexing in Plone
Background and motivation
We have already established the fact that indexing can be improved in Plone in a previous blog posting, where we investigated the potential for improvement by applying an insane/inspired monkeypatch.
The monkeypatch made the Plone site temporarily ignore any indexing requests, and it had side effects. Like the fact that any additional indexing triggered by content creation in factory method or initializeArchetype would be ignored.
Moving forward
The fundamental problem with indexing from a performance point of view, is that indexing is performed instantly instead of at transaction boundary. By postponing indexing until the transaction commits, we are able to filter the indexing events. This enables us to
- Ignore unnecessary indexing (add followed by delete)
- Only do one (re)indexing instead of many on the same object
Technical details
We added an indexing queue, which was in turn controlled by a transaction manager. The transaction manager pattern was reused from enfold.solr.
On modified event, or when (re)indexObject is called, the indexing requests are put in the queue. When the transaction commits, the queue is processed. There is a default reducer which filters duplicates (which can be overridden by registering a new adapter). Then the requests are dispatched to any queue processors registered. The default queue processor is for adding content in portal_catalog, and dispatches to CatalogAware and CatalogMultiplex. You can easily add queue processors for asynchronous queue processing, or for external indexing.
Results
When measuring improvements, we created 100 news articles using the JMeter Test Plan from the collective. Unmodified is plain Plone. Indexing refers to a site with the collective.indexing extension profile applied. Experimental refers to experimental.contentcreation (without redundant reindex hack), and E+I is experimental.contentcreation and collective.indexing together.
Note that the test results are not useful as a measure of absolute performance or to be compared with other tests. We can see that the performance improves when avoiding redundant reindexing, and now it is done in a clean and efficient way.
In a real life deployment with more indexes and metadata, and possibly file conversion as well, the improvement will be bigger.
Try it out for yourself
You can use the buildout for testing this yourself.
svn co http://svn.plone.org/svn/collective/collective.indexing/buildout indexing
cd indexing
python bootstrap.py
bin/buildout
bin/instance start
Create a new Plone site, using the collective.indexing extension profile.
Future plans
While this component is fully usable, it is also about exploring indexing in Zope in general. During the Sorrento Sprint, Malthe Borch and Sylvain Viollon were looking into Xapian integration, and found the time to start splitting collective.indexing into z3c.indexing.dispatcher, which will in turn dispatch to components like z3c.indexing.zcatalog, external indexers or asynchronous queues similar to the one in ore.xapian. For extracting data from content in Plone, there will be a plone namespace packages with adapters for data extraction.
Because collective.indexing is about adapters and utilities and not data structures, it is simple to switch parts of it as it becomes available in z3c packages, and when that's done, move it into the plone namespace and hopefully also towards inclusion in Plone core...
Sponsored by
--
Andreas Zeidler and Helge Tesdal