Category Archives: nutch

[How to] Make custom search with Nutch(v 1.0)?

What is Nutch?

Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make it great:

  1. Open source
  2. Has a web-crawler which understands and indexes html, rtf and pdf format + all links that it might encounter
  3. Its search engine is based on Lucene
  4. Has a plugin based architecture, which means we can have our own plugins for indexing and searching, without a single line of code change to the core nutch.jar
  5. Uses Hadoop for storing indexes, so its pretty scalable

Use case

Suppose we want to search for the author of the website by his email id.

First things first: lets index our custom data

Before we can search for our custom data, we need to index it. Nutch has a plugin architecture very similar to that of Eclipse. We can write our own plugin for indexing. Here is the source code:

Also, you need to create a plugin.xml:

This done, create a new folder in the $NUTCH_HOME/plugins and put the jar and the plugin.xml there.

Now we have to activate this plugin. To do this, we have to edit the conf/nutch-site.xml.

Now, how do I search my indexed data?

Option 1 [cumbersome]:

Add my own query plugin:

Do not forget to edit the plugin.xml.

This line is particularly important:

<parameter name=”fields” value=”email”/>

If you skip this line, you will never be able to see this in search results.

The only catch here is you have to prepend the keyword email: to the search key. For example, if you want to search for, you have to search for or email:jsmith.

There is an easier and more elegant way :), read on…

Option 2 [smart]

Use the existing query-basic plugin.

This involves editing just one file: conf/nutch-default.xml.

In the default distribution, you can see some commented lines like this:

All you have to do is un-comment them and put your custom field, email, in our case in place of description. The resulting fragment will look like:

With this while looking for, you can simply enter or a part the name like jsmit.

Building a Nutch plugin

The preferred way is by ant, but I have used maven with the following dependencies:

Useful links

Be warned that these are a bit out dated, so they may not be correct verbatim.