Lucene Index Merge and Optimisation
Lucene index merge has some parameters that effect how the index is built. This has an impact on the index operations other than search. The MergeFactor controls how many documents are stored within each segment before a new one is started and how many are started before they are collected into a larger one. So a Factor of 10 means, 10 documents before aggregating and 10 aggregated indexes of a certain size before aggregating again. Consequently MergeFactor controls the number of open files.
The higher the merge factor the faster the index build as merging of segments is less frequent. However this causes a significant slow down in the speed which an index can be added to an existing one as this appears to depend on the number of files lucene has to open.
The next one is the MaxBufferedDocs parameter which controls the number of documents to buffer in memory before flushing to disk. For a batch index operation the higher this is the higher the index performance but the more memory will be consumed.
And then there is a MaxMergeDocs which limits the maximum number of documents within a segment above which merging does not happen. This is used to limit the files size, so that no file is over 2G on a 32bit system.
In running the Sakai search indexer operations I have noticed some things in this area
- Once there are about 50 index directories in a merged index, merging takes 2s per merge. Performing an optimize on the index restores the addDirectory operation to 20ms or less. It makes sense to optimize and index when there are more than 50 directories in the index.
- When performing an merge and optimize of a set of indexes, the optimize step can take a lot of time. (minutes). However I have observed that if the index directories are added to an empty index, in the sequence that they were created, the optimize operation is much faster. This may be because the aggregation steps are simpler. This is only an observation.
Installing Sources.
To be able to install jar sources you can run the mvn source:jar maven command and that will put jar sources into your local repo, so you can use them in eclipse.
Reducing Working Code Size
How many of us load the whole of the Sakai Code base into eclipse, and wonder why it consumes so much memory? Most I guess. Alternatively you can just load the code you are working on and just use the local maven repo for the Sakai jars, that way eclipse will run in considerably less memory. When you need to access the source code, if the repo has the source jars, then they can be used instead of the live code base. Obviously this doesn’t allow you to edit all any code anywhere…. but then should we all be doing that anyway… except for those rare debugging exercises.
I did the above for search, editing the .classpath file for eclipse and now I can just work on search with all the other projects close. Eclipse memory usage has dropped from 1G to closer to 128M. Once we package the core (bin and src) into a maven repo, its going to make sense to use this approach. Fortunately maven has support to help us.
Why Spring and Hibernate cant be seperated
After extracting the spring-core, spring-hibernate3 and all the various parts of Sakai, fixing the classloader issues surrounding IdGenerators etc, I find both Spring and Hibernate use CG Lib for proxies, and if you separate Spring from Hibernate, they fight over CGLib. Either Hibernate cant create proxies because it cant see the hibernate classes from CGLib or Spring cant get to CGLib because its not in shared. Looks like its not going to be possible to separate Spring and Hibernate into separate classloaders without providing some extra level of visibility between the classloaders.
Reloading components in webapps… now :)
All this talk of a requirement for reloading components as a requirement for developers got me thinking. Webapps do load spring context, so why not write a context listener for a webapp that loads that webapp up as a component ? Well at the moment almost nothing. Provided I treat the webapp the same as I would a component. I can change the packaging from component to war in pom.xml, add web.xml to the WEB-INF folder with a modified context listener class that loads the component with a new classloader and the correct context classloader. Start up tomcat and hot deploy. The component comes up. I can then redeploy, and I get a whole bunch or spring Infos about bean overloads. And relaoding works, no tomcat restart required.
So what about dependencies ? Well I am doing a tool and a component. My Tool depends on my component, but nothing depends on my component. Maven builds them in order, and so tomcat deploys them in order. The component first and then the tool. So the load order is correct. Obvously, if you reload the component, you have to reload the tool, and anything else that depends on the component, so this technique is not for all.
Problems: Yes, I am not saying this works 100% but I have only tried 2 reloads so far and at the moment not entirely successful. Im trying Rwiki, which has about 5 spring config files in the component. It depends on Hibernate, and sine I shared the same global Session Factory as the rest of sakai, which has already started long ado, Hibernate doesnt know about my Pojos… but there are other tools that use Hibernate in tools, not great for production, but perhapse Ok for faster development. Annother problem I have encountered so far, but not tried to fix is that some of the properties processing doesnt work. Things like ${aproperty.value} and abc@springId dont work, but only because there apear to be the wrong property conversion mechanisms in place.
If I get it working well, I will commit the code somewhere
Update:
Leaving the orriginal components pack in place loads the Hibernate Pojos. Then when you create the new beans from the webapp, they replace the beans in the component. I can now reload RWiki’s component without restarting tomcat. It is a full Entity component with Hibernate storage
Google Calendar
Everyone keeps on telling me how cool Google Calendar is, and how we should throw away our Enterprise calendar systems. I sort of heard what they said an intended to get round to having a look one day. Finally I did. And it is.
Previously I shared iCal with webdav, and it was Ok, but no one could see when I was busy or not, and its hard to manage. With Google Calendar the interface is Ok, you get control over who can do what, including allowing family members to edit some of your calendars. The best bit is that there is good sync support. I can sync to my iCal with Spanning Sync http://www.spanningsync.com/ which just works in the background all the time, bi directional. And by sending my Blackberry to http://m.google.com/sync I can now have my Blackberry synced online, no more messy bluetooth or USB cables, it even integrated with the Blackberry Calendar and puts a “Sync with Google” menu item inside the blackberry Calendar.
So now I just need an application to generate palusable excuses for missing meetings… integrated with my calendar…. as a post event excuse generation alert. ‘If you have just missed this meeting, you might like to say…..’
OSGI Components
Perhapse this is premature, but a quick look around Apache Felix and Spring OSGI (Sorry I should say Spring Modules…. why the name change ?) gives the impression that its not going to be too hard to make most of Sakai OSGI.
So what is needed?
Well we would have to upgrade to Spring 2.5, we would have to write a service definition for each bundle and convert/move the components.xml file into the right places to make them load properly… and then we would have to deal with all the out of band things that go on in Sakai over the classloader boundaries.
IMHO, this would be a more productive long term aim that trying to write our own re-loadable component manager in competition with the component lifecycle that OSGi provides.
I also notice that there is goos support in eclipse for automating the creation of new bundles as eclipse plugins are just OSGi bundles, but for us the more important thig appears to be that Felix and Spring Modules will work with little or no effort inside Tomcat if we use the top level container concepts.
Spring Proxes that Dont appear to quite work.
Spring Auto Proxies take the hassle out of AOP, but they dont always work. The situation. I have an implementation in a classloader (component classloader) that cant be seen by the classloader that spring lives in (shared classloader). The Service API to the implementation is in the shared classloader. So I can create an Auto Proxy on the Service API and all is Ok. But then there are some configuration settings in the implementation, expressed as getters and setters, that are not present in the API. So in Sakai, sakai.properties does this all over the place with a method@springID where springID is traditionally the API id.
Unfortunately Auto Proxies cant proxy the impl without being able to see it, so they don’t work. Now you could re implement the Auto Proxy and put it in the same classloader as the impl, and then all would be cool, or you could do some AOP in the impl and not bother with the Auto Proxy… but thats not quite as easy as an auto proxy.
Naturally the danger appears when you think you have proxied the impl, and something deep in the code binds to the impl via reflection… but you have only proxied the API.
So…. dont bind to implementations at all… ever…. especially not via reflective methods… or casting… it will catch you out eventually. Unfortunately there is a lot of code out there that has made the assumption that it lives in single classloader….. and binds internally to its impls…. you have been warned ![]()
Component Loaders
I have been looking at an alternative mechanism of loading non webapp components in sakai for about 2 weeks. Currently we load the component manager as a side effect of webapp startup. Using a static ComponentManager factory means that the first thing to access it causes it to startup. Consequently, the first webapp with a context listener will perform a start-up. There is nothing particularly wrong with this other than the lack of control over the startup and lifecycle of the component manager, coupled with the need to have a static ComponentManager factory.
So, I have written a Tomcat Container Lifecycle listener that attaches to the base Container Lifecycle. This starts the component manager, as a bean not a static factory, and then registers a JMX MBean that is used as the holder. So we get full control of the startup of the Component manager, its classloader structure and we can manage it through JMX (ie start and stop it). The component manager starts up before any webapps start and so are not tied to the lifecycle of the webapp it started in.
In addition the component manager is now a normal bean implementing and interface, there is a proxy bean that enables consuming components to do a new ComponentManagerProxy(); a bean they own, which connects to JMX managed component manager. It should now be easy to create unit tests that run inside eclipse without the entire framework.
In addition to the component manager loader, I have a hook into the Host container inside tomcat that allows us to control the startup of each webapp. This will enable us to replace the classloaders of the webapp with something that makes better use of perm space and binds to the component manager structure.
The drawbacks of this approach is that the container needs to be targeted, but its already clear that this is required since the classloader structure of tomcat 6 is totally different from tomcat 5. Performing the start-up in this way isolates the classloader setup and gives us the felexability to choose a scheme that suits our needs.
Currently I have implementations for tomcat5 and tomcat6. WebSphere community edition uses tomcat 6 as a webapp container. I haven’t looked closely at WebLogic, Glassfish or JBoss, but they do have some sort of container lifecycle below the J2EE lifecycle so it should be possible to apply the same approach. The code is in branch https://source.sakaiproject.org/svn/component/branches/SAK-12134 but be warned, at the time of writing this code is wild west.
Vectors … read the small print
Well I thought Vectors were synchronized. The Object is, but Iterators are not.
From the Java Doc
” The Iterators returned by Vector’s iterator and listIterator methods are fail-fast: if the Vector is structurally modified at any time after the Iterator is created, in any way except through the Iterator’s own remove or add methods, the Iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the Iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future. The Enumerations returned by Vector’s elements method are not fail-fast. ”
Hence, create an iterator on a vector and you will get a ConcurrentModificationException on the iterator if the vector is modified, even though the Vector itself is Synchronized.