Latest News

Monday, 9 November 2009

Develop Your Own Google with Apache Lucene (Java Nutch Solr)

Apache Lucene is Open Source API that allows a Java developer (.Net libraries available) to write indexing and full-text search capable applications. I have been writing applications based on Lucene for the last 3 years and some of the applications have been deployed at large corporations. I know there are other libraries available to developers who wish to write indexing engine but this blog will solely focus on Apache Lucene. I will not compare it to other API.

Lucene is a very mature API and can be found in NetBeans IDE, Liferay, JackRabbit among others. IBM has written a very good document about the Lucene architecture, therefore I will not delve into it here.







Lucene, alone, is pretty much useless as any other API. Let's now introduce Nutch. Nutch is a web crawler built-on top of Lucene to provide file crawling capabilities. Nutch was designed to handle large amount of data from the internet (http). Due to its plugin architecture, it was later extended to provide local network crawling such as FTP, databases and Microsoft Windows Shares (I am the author of the protocol-smb plugin and co-author of the index-extra plugin found on the Nutch site). We had extended Nutch and turned it into an Enterprise Search application but most of the source codes were locked behind closed doors (company politics).Anyway, Nutch has evolved to become but still very complex in its inner working. The initial Nutch was developed to process data in a batch but there are ways to turn it real-time but that's for another day. Ok, so Nutch is good for crawling and indexing of data but it does not handle search directly. There is a web application available with Nutch but it is quite poor so let's now introduce the Solr.

Solr is a powerful web-based search server built-on top of Lucene. The application was developed by CNET Networks and donated to the Apache Foundation. I believe, not too sure so I might need some references here, Solr was powering the search feature on their site but it is definitely used internally by the company. Late 2009, Lucid Imaginations receives $7.5 million in funding to provide commercial services built around Solr (and Lucene possibly?). Here is a very good presentation about Solr. Solr is a very good indexing engine. The keyword here is "indexing engine". It does not have any support for crawling data therefore requiring the developer to create applications that will feed it the data to index. I do believe that it is a good feature of the application as it gives the ability to integrate with various systems as long as they can post data over HTTP.

Nutch is a good crawler but it does not provide an enterprise-grade search interface to its data. Solr, in the other hand, is powerful indexer and has an enterprise grade search interface but it does not know how to gather data in its own. I am sure by now it has become obvious how we can integrate them both together.

We want Nutch to gather the data, by-pass its indexing cycle and feed the data directly to Solr. Lucid Imagination has a good tutorial about it here.







After reading the tutorial from Lucid Imagination, you will notice that Nutch is run by executing some bash files. This is something I strongly disagree with. If Nutch is based on Java (an OS independent language), why do we need to execute some UNIX/LINUX shell script. Also, the fact we need to install CygWin on MS Windows platform to be able to run is a big negative for me. I wrote a simple Java application that will launch Nutch and send the indexing to Solr but as you can see in the source code, you still need a UNIX like environment to run successfully. You can write a platform independent version by looking up Nutch API and calling the methods directly.

Well, I hope that this entry help you understand how to use Nutch and Solr built-on top of Apache Lucene. If you need any clarification, leave comments and I will try to gave ASAP if time permits.

package com.etapix.nutchsolr;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 *
 * @author Armel Nene
 */
public class Indexer {

    public static void main(String args[]) {
        if (args.length < 3) {
            System.out.println("Usage:" +
                    "\ncrawlName        -   This will be used to store crawler files in CrawlName directory" +
                    "\nurlFolder        -   The path to the folder containing the URL to crawl" +
                    "\nsolrUrl          -   The URL to the Solr server");
            return;
        }
        String crawlerName = args[0];
        String urlFolder = args[1];
        String solrUrl = args[2];
        String inject = "bash bin/nutch2.sh inject " + crawlerName + "/crawldb " + urlFolder;
        String generate = "bash bin/nutch2.sh generate " + crawlerName + "/crawldb " + crawlerName + "/segments -topN 10 -numFetchers 5";
        String export = crawlerName + "/segments/";

        String invertLinks = "bash bin/nutch2.sh invertlinks " + crawlerName + "/linkdb -dir " + crawlerName + "/segments";
        String indexSolr = "bash bin/nutch2.sh solrindex " + solrUrl + " " + crawlerName + "/crawldb " + crawlerName + "/linkdb " + crawlerName + "/segments/*";
        try {
            System.out.println("Injecting URLs in crawldb");
//            int state = 0;
            InputStream in = Runtime.getRuntime().exec(inject).getInputStream();
            System.out.println(convertStreamToString(in));


//            state = Runtime.getRuntime().exec(inject).waitFor();
//            System.out.println("process completed: " + state);

            for (int i = 0; i < 3; i++) {
                System.out.println("Generating segments");
                in = Runtime.getRuntime().exec(generate).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Setting environment variable $SEGMENT");
//            String segs = convertStreamToString(Runtime.getRuntime().exec("ls -tr " + crawlerName + "/segments|tail -1").getInputStream());

                String segments = export + lastFileModified(export).getName();
                System.out.println("$segments: " + segments);
//            in = Runtime.getRuntime().exec(export + segs).getInputStream();
            System.out.println(convertStreamToString(in));

                String fetch = "bash bin/nutch2.sh fetch " + segments + " -noParsing";
                String parse = "bash bin/nutch2.sh parse " + segments;
                String update = "bash bin/nutch2.sh updatedb " + crawlerName + "/crawldb " + segments + " -filter -normalize";

                System.out.println("fetch segments");
                in = Runtime.getRuntime().exec(fetch).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Parse segments");
                in = Runtime.getRuntime().exec(parse).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Update crawldb");
                in = Runtime.getRuntime().exec(update).getInputStream();
                System.out.println(convertStreamToString(in));
            }
            System.out.println("Inverting links");
            in = Runtime.getRuntime().exec(invertLinks).getInputStream();
            System.out.println(convertStreamToString(in));

            System.out.println("Indexing contents to Solr " + solrUrl);
            in = Runtime.getRuntime().exec(indexSolr).getInputStream();
            System.out.println(convertStreamToString(in));

        } catch (Exception ex) {
            Logger.getLogger(Indexer.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    public static String convertStreamToString(InputStream is) {
        /*
         * To convert the InputStream to String we use the BufferedReader.readLine()
         * method. We iterate until the BufferedReader return null which means
         * there's no more data to read. Each line will appended to a StringBuilder
         * and returned as String.
         */
        BufferedReader reader = new BufferedReader(new InputStreamReader(is));
        StringBuilder sb = new StringBuilder();

        String line = null;
        System.out.println("Now converting inputstream to text");
        try {
            while ((line = reader.readLine()) != null) {
                sb.append(line + "\n");
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        System.out.println("Finish converting to text");
        return sb.toString();
    }

    public static File lastFileModified(String dir) {
        File fl = new File(dir);
        File[] files = fl.listFiles(new FileFilter() {

            public boolean accept(File file) {
                return file.isDirectory();
            }
        });
        long lastMod = Long.MIN_VALUE;
        File choice = null;
        for (File file : files) {
            if (file.lastModified() > lastMod) {
                choice = file;
                lastMod = file.lastModified();
            }
        }
        return choice;
    }
}

  • Blogger Comments
  • Facebook Comments

5 comments :

  1. hmmmm nice an informative blog abt Lucene... i only use Hibernate Search API to add searching capablity in my application..well u say Nutch is a web craler can u explain it.. can u show sample example only in Nutch or can we use Nutch alone...

    ReplyDelete
  2. Hope this link helps. http://en.wikipedia.org/wiki/Web_crawler
    Nutch is a powerful web crawler that can be used as a standalone application. You can then develop your own Lucene search interface to search the repositories. Here is tutorial on how to setup Nutch to run as a standalone application http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
    Nutch is much more powerful than Hibernate and Lucene, to my opinion, is much better for full-text search.
    You can also use the Nutch API to embed Nutch into your application.

    ReplyDelete
  3. Hi
    Please could you let me know,
    how to index Solr using Web Services (WSDL)

    Thank you

    ReplyDelete
  4. Hi
    Please could you let me know
    How to index Solr using Web Services (WSDL)?

    thank you.

    ReplyDelete

Item Reviewed: Develop Your Own Google with Apache Lucene (Java Nutch Solr) Description: Rating: 5 Reviewed By: Unknown
Scroll to Top