Monthly Archives: April 2009

[How to] Make custom search with Nutch(v 1.0)?

What is Nutch?

Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make it great:

  1. Open source
  2. Has a web-crawler which understands and indexes html, rtf and pdf format + all links that it might encounter
  3. Its search engine is based on Lucene
  4. Has a plugin based architecture, which means we can have our own plugins for indexing and searching, without a single line of code change to the core nutch.jar
  5. Uses Hadoop for storing indexes, so its pretty scalable

Use case

Suppose we want to search for the author of the website by his email id.

First things first: lets index our custom data

Before we can search for our custom data, we need to index it. Nutch has a plugin architecture very similar to that of Eclipse. We can write our own plugin for indexing. Here is the source code:

Also, you need to create a plugin.xml:

This done, create a new folder in the $NUTCH_HOME/plugins and put the jar and the plugin.xml there.

Now we have to activate this plugin. To do this, we have to edit the conf/nutch-site.xml.

Now, how do I search my indexed data?

Option 1 [cumbersome]:

Add my own query plugin:

Do not forget to edit the plugin.xml.

This line is particularly important:

<parameter name=”fields” value=”email”/>

If you skip this line, you will never be able to see this in search results.

The only catch here is you have to prepend the keyword email: to the search key. For example, if you want to search for, you have to search for or email:jsmith.

There is an easier and more elegant way :), read on…

Option 2 [smart]

Use the existing query-basic plugin.

This involves editing just one file: conf/nutch-default.xml.

In the default distribution, you can see some commented lines like this:

All you have to do is un-comment them and put your custom field, email, in our case in place of description. The resulting fragment will look like:

With this while looking for, you can simply enter or a part the name like jsmit.

Building a Nutch plugin

The preferred way is by ant, but I have used maven with the following dependencies:

Useful links

Be warned that these are a bit out dated, so they may not be correct verbatim.

Jar Browser

This is a Swing app to look for a class, package or any resource in a set of given jar/zip files.

Use case

I have got an ugly NoClassDef exception and I suspect the class to be present in some 5 or so odd jars present in a certain folder. I Enter the class name in the text box and point the probable jars form AddJars… and click Search. The matches are shown in the tree view.

How to run?

You need to have JDK1.6 or above. If you want to run it with 1.5, compile the sources with 1.5.

On Linux: java -jar JarBrowser.jar

On Windows: just double click on the JarBrowser.jar or use the same command.

Find the binaries here and the source here.


A simple wait notify example

At times we often need to fetch an object which might take a long time. Our preferred way of doing that, especially when we are on a UI thread, is to spawn a different thread so as to keep the UI responsive (this is just one of the many use cases that I can think of now). But since we need that object to proceed further in the current execution, we have to resort to some sort of wait/notify mechanism. The following code demoes a very simplistic approach using the regular wait()/notify().

Note: This is far from fool proof. One case where it will fail is if the long task is over before that lock.wait() is called.

Why Teneo won’t work with Ehcache?

The first problem is that by default EMF classes are not Serializable. The very first stack trace that you will get in two parts will look something like:

Stack Trace I

java.lang.ClassCastException: xxx.impl.SomeEmfClassImpl
at org.hibernate.type.AbstractType.disassemble(
at org.hibernate.type.TypeFactory.disassemble(
at org.hibernate.cache.entry.CacheEntry.<init>(
at org.hibernate.engine.TwoPhaseLoad.initializeEntity(
at org.hibernate.loader.Loader.initializeEntitiesAndCollections(
at org.hibernate.loader.Loader.doQuery(
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(

Stack Trace II

**Serious** 04/04/09 12:08:15 PM    org.hibernate.HibernateException: A collection with cascade=”all-delete-orphan” was no longer referenced by the owning entity instance: Properties.trail
at org.hibernate.engine.Collections.processDereferencedCollection(
at org.hibernate.engine.Collections.processUnreachableCollection(
at org.hibernate.event.def.AbstractFlushingEventListener.flushCollections(
at org.hibernate.event.def.AbstractFlushingEventListener.flushEverythingToExecutions(
at org.hibernate.event.def.DefaultFlushEventListener.onFlush(
at org.hibernate.impl.SessionImpl.flush(
at org.hibernate.impl.SessionImpl.managedFlush(
at org.hibernate.transaction.JDBCTransaction.commit(

Stack Trace II is most confusing. If you trust Stack Trace II, it will lead you off on a false scent. You will think that there is a problem with the way EMF handles collections. You will constatntly rant and spit fire at them, thinking there is nothing to be done except selectively disabling caching for those queries.

But friends, here is the killer: Stack Trace II is caused by Stack Trace I. So let us explore that first. If you look at the very first line: ClassCastException. What has caused this? Let us see.

at org.hibernate.type.AbstractType.disassemble(

If you look at the source code AbstractType.disassemble() looks like:

So in line 78, the EMF model is casted into Serializable, and hence the cause of exception.


There is no other go, but to make the EMF models Serializable. How do we do that? We have to edit the default JET Templates so that all generated interfaces by EMF are Serializable. Here is an excellent article which will guide you. The template to edit is templates/model/Class.javajet.

<%if (isImplementation) {%>
public<%if (genClass.isAbstract()) {%> abstract<%}%> class <%=genClass.getClassName()%><%=genClass.getTypeParameters().trim()%><%=genClass.getClassExtends()%><%=genClass.getClassImplements()%>
<%} else {%>
public interface <%=genClass.getInterfaceName()%><%=genClass.getTypeParameters().trim()%><%=genClass.getInterfaceExtends()%>

Now re-generate the model code from genmodel and try again.

With this, some operations, of course will work. But if you are using Teneo lesser than 1.0.4 (1.0.3 is the latest release as of now ), for most of the operations you will get a trace which looks something like:

Stack Trace III

04/04/09 01:05:47.164 PM  131221 WARN  org.apache.struts.chain.commands.AbstractExceptionHandler – Unhandled exception
at xxx.MyInterceptor.getEntityName(
at org.hibernate.impl.SessionImpl.guessEntityName(
at org.hibernate.impl.SessionImpl.bestGuessEntityName(
at org.eclipse.emf.teneo.hibernate.mapping.econtainer.EContainerUserType.assemble(
at org.hibernate.type.TypeFactory.assemble(
at org.hibernate.cache.entry.CacheEntry.assemble(

Before you get lost in the trace, lets take a quick look at what actually happens. Hibernate encounters a cache-entry and tries to make sense of it and construct the actual object queried. In the process, it is delegated to the TypeFactory.assemble(), which determines the type of the entry and delegates it to an implementation of org.hibernate.type.Type. So far so good. So what goes wrong?

at org.eclipse.emf.teneo.hibernate.mapping.econtainer.EContainerUserType.assemble(

If you have a look at the source code and then its super class, org.hibernate.type.AbstractType, at the assemble() method, you will notice that one significant difference is, in AbstractType.assemble(), if the cached parameter is null, it returns a null, assuming that Hibernate will fetch it from the Database instead.

While in EContainerUserType.assemble(), there is no null check that is what causes the big trace. Grab the Teneo sources, apply the fix, re-compile and bingo!

This fix will be available in Teneo 1.0.4 release. I have raised a bug on this and Martin has been kind enough to check it in.

For this article, I am using EMF 2.3.0_v_200706262000, Teneo 0.8_v_200708101732,  Hibernate 3.3.1.GA and Ehcache 1.5.0. You will find the Teneo sources in the public CVS repository here

Configuring connection pooling with Teneo

Connection pooling is one thing that needs to be done on a production server. Spring is becoming an extremely popular choice with Hibernate for its seamless integartion. But with Teneo, its a different ball-game altogether. Fortunately for us, Hibernate provides one such hook via the property in the Environment.CONNECTION_PROVIDER configuration. It takes in the name of a class implementing the interface ConnectionProvider.

I prefer the Apache Commons DBCP. You will find the example here. When configuring Teneo, you need to set: looks like: