Latest News

    • Web Services Architecture – When to Use SOAP vs REST

      SOAP (Simple Object Access Protocol) and REST (Representation State Transfer) are popular with developers working on system integration based projects. Software architects will design the application from various perspectives and also decides, based on various reasons, which approach to take to expose new API to third party applications. As a software architect, it is good practice to involve your development team lead during system architecture process.   This article, based on my experience, will discuss when to use SOAP or REST web services to expose your API to third party clients.   Web Services Demystified Web services are part of the Services Oriented Architecture. Web services are used as the model for process decomposition and assembly. I have been involved in discussion where there were some misconception between web services and web API.   The W3C defines a Web Service generally as:   A software system designed to support interoperable machine-to-machine interaction over a network.   Web API also known as Server-Side Web API is a programmatic interface to a defined request-response message system, typically expressed in JSON or XML, which is exposed via the web – most commonly by means of an HTTP-based web server. (extracted from Wikipedia)   Based on the above definition, one can insinuate when SOAP should be used instead of REST and vice-versa but it is not as simple as it looks. We can agree that Web Services are not the same as Web API. Accessing an image over the web is not calling a web service but retrieving a web resources using is Universal Resource Identifier. HTML has a well-defined standard approach to serving resources to clients and does not require the use of web service in order to fulfill their request.   Why Use REST over SOAP Developers are passionate people. Let's briefly analyze some of the reasons they mentioned when considering REST over SOAP:   REST is easier than SOAP I'm not sure what developers refer to when they argue that REST is easier than SOAP. Based on my experience, depending on the requirement, developing REST services can quickly become very complex just as any other SOA projects. What is your service abstracting from the client? What is the level of security required? Is your service a long running asynchronous process? And many other requirements will increase the level of complexity. Testability: apparently it easier to test RESTFul web services than their SOAP counter parts. This is only partially true; for simple REST services, developers only have to point their browser to the service endpoints and a result would be returned in the response. But what happens once you need to add the HTTP headers and passing of tokens, parameters validation… This is still testable but chances are you will require a plugin for your browser in order to test those features. If a plugin is required then the ease of testing is exactly the same as using SOAPUI for testing SOAP based services.   RESTFul Web Services serves JSON that is faster to parse than XML This so called "benefit" is related to consuming web services in a browser. RESTFul web services can also serve XML and any MIME type that you desire. This article is not focused on discussing JSON vs XML; and I wouldn't write any separate article on the topic. JSON relates to JavaScript and as JS is very closed to the web, as in providing interaction on the web with HTML and CSS, most developers automatically assumes that it also linked to interacting with RESTFul web services. If you didn't know before, I'm sure that you can guess that RESTFul web services are language agnostic. Regarding the speed in processing the XML markup as opposed to JSON, a performance test conducted by David Lead, Lead Engineer at MarkLogic Inc, find out to be a myth.   REST is built for the Web Well this is true according to Roy Fielding dissertation; after all he is credited with the creation of REST style architecture. REST, unlike SOAP, uses the underlying technology for transport and communication between clients and servers. The architecture style is optimized for the modern web architecture. The web has outgrown is initial requirements and this can be seen through HTML5 and web sockets standardization. The web has become a platform on its own right, maybe WebOS. Some applications will require server-side state saving such as financial applications to e-commerce.   Caching When using REST over HTTP, it will utilize the features available in HTTP such as caching, security in terms of TLS and authentication. Architects know that dynamic resources should not be cached. Let's discuss this with an example; we have a RESTFul web service to serve us some stock quotes when provided with a stock ticker. Stock quotes changes per milliseconds, if we make a request for BARC (Barclays Bank), there is a chance that the quote that we have receive a minute ago would be different in two minutes. This shows that we cannot always use the caching features implemented in the protocol. HTTP Caching be useful in client requests of static content but if the caching feature of HTTP is not enough for your requirements, then you should also evaluate SOAP as you will be building your own cache either way not relying on the protocol.   HTTP Verb Binding HTTP verb binding is supposedly a feature worth discussing when comparing REST vs SOAP. Much of public facing API referred to as RESTFul are more REST-like and do not implement all HTTP verb in the manner they are supposed to. For example; when creating new resources, most developers use POST instead of PUT. Even deleting resources are sent through POST request instead of DELETE.   SOAP also defines a binding to the HTTP protocol. When binding to HTTP, all SOAP requests are sent through POST request.   Security Security is never mentioned when discussing the benefits of REST over SOAP. Two simples security is provided on the HTTP protocol layer such as basic authentication and communication encryption through TLS. SOAP security is well standardized through WS-SECURITY. HTTP is not secured, as seen in the news all the time, therefore web services relying on the protocol needs to implement their own rigorous security. Security goes beyond simple authentication and confidentiality, and also includes authorization and integrity. When it comes to ease of implementation, I believe that SOAP is that at the forefront.   Conclusion This was meant to be a short blog post but it seems we got to passionate about the subject.   I accept that there are many other factors to consider when choosing SOAP vs REST but I will over simplify it here. For machine-to-machine communications such as business processing with BPEL, transaction security and integrity, I suggest using SOAP. SOAP binding to HTTP is possible and XML parsing is not noticeably slower than JSON on the browser. For building public facing API, REST is not the undisputed champion. Consider the actual application requirements and evaluate the benefits. People would say that REST protocol agnostic and work on anything that has URI is beside the point. According to its creator, REST was conceived for the evolution of the web. Most so-called RESTFul web services available on the internet are more truly REST-like as they do not follow the principle of the architectural style. One good thing about working with REST is that application do not need a service contract a la SOAP (WSDL). WADL was never standardized and I do not believe that developers would implement it. I remember looking for Twitter WADL to integrate it.   I will leave you to make your own conclusion. There is so much I can write in a blog post. Feel free to leave any comments to keep the discussion going.

Monday, 9 November 2009

Develop Your Own Google with Apache Lucene (Java Nutch Solr)

Apache Lucene is Open Source API that allows a Java developer (.Net libraries available) to write indexing and full-text search capable applications. I have been writing applications based on Lucene for the last 3 years and some of the applications have been deployed at large corporations. I know there are other libraries available to developers who wish to write indexing engine but this blog will solely focus on Apache Lucene. I will not compare it to other API.

Lucene is a very mature API and can be found in NetBeans IDE, Liferay, JackRabbit among others. IBM has written a very good document about the Lucene architecture, therefore I will not delve into it here.







Lucene, alone, is pretty much useless as any other API. Let's now introduce Nutch. Nutch is a web crawler built-on top of Lucene to provide file crawling capabilities. Nutch was designed to handle large amount of data from the internet (http). Due to its plugin architecture, it was later extended to provide local network crawling such as FTP, databases and Microsoft Windows Shares (I am the author of the protocol-smb plugin and co-author of the index-extra plugin found on the Nutch site). We had extended Nutch and turned it into an Enterprise Search application but most of the source codes were locked behind closed doors (company politics).Anyway, Nutch has evolved to become but still very complex in its inner working. The initial Nutch was developed to process data in a batch but there are ways to turn it real-time but that's for another day. Ok, so Nutch is good for crawling and indexing of data but it does not handle search directly. There is a web application available with Nutch but it is quite poor so let's now introduce the Solr.

Solr is a powerful web-based search server built-on top of Lucene. The application was developed by CNET Networks and donated to the Apache Foundation. I believe, not too sure so I might need some references here, Solr was powering the search feature on their site but it is definitely used internally by the company. Late 2009, Lucid Imaginations receives $7.5 million in funding to provide commercial services built around Solr (and Lucene possibly?). Here is a very good presentation about Solr. Solr is a very good indexing engine. The keyword here is "indexing engine". It does not have any support for crawling data therefore requiring the developer to create applications that will feed it the data to index. I do believe that it is a good feature of the application as it gives the ability to integrate with various systems as long as they can post data over HTTP.

Nutch is a good crawler but it does not provide an enterprise-grade search interface to its data. Solr, in the other hand, is powerful indexer and has an enterprise grade search interface but it does not know how to gather data in its own. I am sure by now it has become obvious how we can integrate them both together.

We want Nutch to gather the data, by-pass its indexing cycle and feed the data directly to Solr. Lucid Imagination has a good tutorial about it here.







After reading the tutorial from Lucid Imagination, you will notice that Nutch is run by executing some bash files. This is something I strongly disagree with. If Nutch is based on Java (an OS independent language), why do we need to execute some UNIX/LINUX shell script. Also, the fact we need to install CygWin on MS Windows platform to be able to run is a big negative for me. I wrote a simple Java application that will launch Nutch and send the indexing to Solr but as you can see in the source code, you still need a UNIX like environment to run successfully. You can write a platform independent version by looking up Nutch API and calling the methods directly.

Well, I hope that this entry help you understand how to use Nutch and Solr built-on top of Apache Lucene. If you need any clarification, leave comments and I will try to gave ASAP if time permits.

package com.etapix.nutchsolr;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
 *
 * @author Armel Nene
 */
public class Indexer {

    public static void main(String args[]) {
        if (args.length < 3) {
            System.out.println("Usage:" +
                    "\ncrawlName        -   This will be used to store crawler files in CrawlName directory" +
                    "\nurlFolder        -   The path to the folder containing the URL to crawl" +
                    "\nsolrUrl          -   The URL to the Solr server");
            return;
        }
        String crawlerName = args[0];
        String urlFolder = args[1];
        String solrUrl = args[2];
        String inject = "bash bin/nutch2.sh inject " + crawlerName + "/crawldb " + urlFolder;
        String generate = "bash bin/nutch2.sh generate " + crawlerName + "/crawldb " + crawlerName + "/segments -topN 10 -numFetchers 5";
        String export = crawlerName + "/segments/";

        String invertLinks = "bash bin/nutch2.sh invertlinks " + crawlerName + "/linkdb -dir " + crawlerName + "/segments";
        String indexSolr = "bash bin/nutch2.sh solrindex " + solrUrl + " " + crawlerName + "/crawldb " + crawlerName + "/linkdb " + crawlerName + "/segments/*";
        try {
            System.out.println("Injecting URLs in crawldb");
//            int state = 0;
            InputStream in = Runtime.getRuntime().exec(inject).getInputStream();
            System.out.println(convertStreamToString(in));


//            state = Runtime.getRuntime().exec(inject).waitFor();
//            System.out.println("process completed: " + state);

            for (int i = 0; i < 3; i++) {
                System.out.println("Generating segments");
                in = Runtime.getRuntime().exec(generate).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Setting environment variable $SEGMENT");
//            String segs = convertStreamToString(Runtime.getRuntime().exec("ls -tr " + crawlerName + "/segments|tail -1").getInputStream());

                String segments = export + lastFileModified(export).getName();
                System.out.println("$segments: " + segments);
//            in = Runtime.getRuntime().exec(export + segs).getInputStream();
            System.out.println(convertStreamToString(in));

                String fetch = "bash bin/nutch2.sh fetch " + segments + " -noParsing";
                String parse = "bash bin/nutch2.sh parse " + segments;
                String update = "bash bin/nutch2.sh updatedb " + crawlerName + "/crawldb " + segments + " -filter -normalize";

                System.out.println("fetch segments");
                in = Runtime.getRuntime().exec(fetch).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Parse segments");
                in = Runtime.getRuntime().exec(parse).getInputStream();
                System.out.println(convertStreamToString(in));

                System.out.println("Update crawldb");
                in = Runtime.getRuntime().exec(update).getInputStream();
                System.out.println(convertStreamToString(in));
            }
            System.out.println("Inverting links");
            in = Runtime.getRuntime().exec(invertLinks).getInputStream();
            System.out.println(convertStreamToString(in));

            System.out.println("Indexing contents to Solr " + solrUrl);
            in = Runtime.getRuntime().exec(indexSolr).getInputStream();
            System.out.println(convertStreamToString(in));

        } catch (Exception ex) {
            Logger.getLogger(Indexer.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    public static String convertStreamToString(InputStream is) {
        /*
         * To convert the InputStream to String we use the BufferedReader.readLine()
         * method. We iterate until the BufferedReader return null which means
         * there's no more data to read. Each line will appended to a StringBuilder
         * and returned as String.
         */
        BufferedReader reader = new BufferedReader(new InputStreamReader(is));
        StringBuilder sb = new StringBuilder();

        String line = null;
        System.out.println("Now converting inputstream to text");
        try {
            while ((line = reader.readLine()) != null) {
                sb.append(line + "\n");
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        System.out.println("Finish converting to text");
        return sb.toString();
    }

    public static File lastFileModified(String dir) {
        File fl = new File(dir);
        File[] files = fl.listFiles(new FileFilter() {

            public boolean accept(File file) {
                return file.isDirectory();
            }
        });
        long lastMod = Long.MIN_VALUE;
        File choice = null;
        for (File file : files) {
            if (file.lastModified() > lastMod) {
                choice = file;
                lastMod = file.lastModified();
            }
        }
        return choice;
    }
}

  • Blogger Comments
  • Facebook Comments

6 comments :

  1. hmmmm nice an informative blog abt Lucene... i only use Hibernate Search API to add searching capablity in my application..well u say Nutch is a web craler can u explain it.. can u show sample example only in Nutch or can we use Nutch alone...

    ReplyDelete
  2. Hope this link helps. http://en.wikipedia.org/wiki/Web_crawler
    Nutch is a powerful web crawler that can be used as a standalone application. You can then develop your own Lucene search interface to search the repositories. Here is tutorial on how to setup Nutch to run as a standalone application http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
    Nutch is much more powerful than Hibernate and Lucene, to my opinion, is much better for full-text search.
    You can also use the Nutch API to embed Nutch into your application.

    ReplyDelete
  3. Hi
    Please could you let me know,
    how to index Solr using Web Services (WSDL)

    Thank you

    ReplyDelete
  4. Hi
    Please could you let me know
    How to index Solr using Web Services (WSDL)?

    thank you.

    ReplyDelete
  5. I find this explanation of Lucene's ecosystem very helpful.

    ReplyDelete

Item Reviewed: Develop Your Own Google with Apache Lucene (Java Nutch Solr) Description: Rating: 5 Reviewed By: Unknown