Journal of Learning Apache Lucene

Journal of Learning Apache Lucene - Lucene In Action

source: https://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

Lucene is a high-performance, scalable, search engine technology. Both indexing and searching features make up the Lucene API. The first part of this article takes you through an example of using Lucene to index all the text files in a directory and its subdirectories. Before proceeding to examples of analysis and searching, we'll take a brief detour to discuss the format of the index directory.

Indexing

We'll begin by creating the Indexer class that will be used to index all the text files in a specified directory. This class is a utility class with a single public method index() that takes two arguments. The first argument is a File object indexDir that corresponds to the directory where the index will be created. The second argument is another File object dataDir that corresponds to the directory to be indexed.

public static void index(File indexDir, File dataDir) throws IOException {
    if (!dataDir.exists() || !dataDir.isDirectory()) {
       throw new IOException(dataDir + " does not exist or is not a directory");
    }

    IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
    indexDirectory(writer, dataDir);
    writer.close();
}

After checking that dataDir exists and is a directory, we instantiate the IndexWriter object that will be used to create the index. The IndexWriter constructor used above accepts as its first parameter the directory where the index will be created, with the last argument mandating that it be created from scratch rather than reusing an index that may already exist in that same location. The middle parameter is the analyzer to use for tokenized fields. Field analysis is described below, but for now we can take for granted that the important words in the file will
be indexed thanks to the StandardAnalyzer.

The indexDirectory() walks the directory tree, scanning for .txt files. Any .txt file will be indexed using the indexFile() method, any directory will be processed using the indexDirectory()method, and any other file will be ignored. Here is the code for indexDirectory.

private static void indexDirectory(IndexWriter writer, File dir) throws IOException {
    File[] files = dir.listFiles();

    for (int i=0; i < files.length; i++) {
        File f = files[i];
        if (f.isDirectory()) {
           indexDirectory(writer, f);  // recurse
        } else if (f.getName().endsWith(".txt")) {
           indexFile(writer, f);
        }
    }
}

The indexDirectory() method lives independently from Lucene. This is an example of Lucene usage in general -- using Lucene rarely involves much coding directly with the Lucene API, but rather relies on your cleverness using it. And finally in the Indexer class, we get to the heart of its purpose, indexing a single text file:

private static void indexFile(IndexWriter writer, File f) throws IOException {
    System.out.println("Indexing " + f.getName());

    Document doc = new Document();
    doc.add(Field.Text("contents", new FileReader(f)));
    doc.add(Field.Keyword("filename", f.getCanonicalPath()));
    writer.addDocument(doc);
}

And believe it or not, we're done! We've just indexed an entire directory tree of text files. Yes, it really is that simple. To summarize, all it took to create this index were these steps:

Create an IndexWriter.
Locate each file to be indexed by walking the directory looking for file names ending in .txt.
For each text file, create a Document with the desired Fields.
Add the document to the IndexWriter instance.

Let's assemble these methods into an Indexer class and add the appropriate imports. You can index
a file by calling Indexer.index( indexDir, dataDir). We've also added a main() method so
the Indexer can be run from the command line with the two directories passed in as command line
parameters.

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.File;
import java.io.IOException;
import java.io.FileReader;

public class Indexer {
    public static void index(File indexDir, File dataDir) throws IOException {
        if (!dataDir.exists() || !dataDir.isDirectory()) {
            throw new IOException(dataDir + " does not exist or is not a directory");
        }
        IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
        indexDirectory(writer, dataDir);
        writer.close();
    }

    private static void indexDirectory(IndexWriter writer, File dir) throws IOException {
        File[] files = dir.listFiles();

        for (int i=0; i < files.length; i++) {
            File f = files[i];
            if (f.isDirectory()) {
                indexDirectory(writer, f);  // recurse
            } else if (f.getName().endsWith(".txt")) {
                indexFile(writer, f);
            }
        }
    }

    private static void indexFile(IndexWriter writer, File f) throws IOException {
        System.out.println("Indexing " + f.getName());

        Document doc = new Document();

/* Field(String name, Reader reader)
 * Create a tokenized and indexed field that is not stored. Term vectors will not be stored. * The Reader is read only when the Document is added to the index, i.e. you may not close

 * the Reader until IndexWriter.addDocument(Document) has been called.*/

       doc.add(new Field("contents", new FileReader(f)));
        doc.add(new Field("filename", f.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED)        );

         doc.add(new Field("fullpath", f.getCanonicalPath(),Field.Store.YES, Field.Index.NOT_         ANALYZED));
        writer.addDocument(doc);
    }
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            throw new Exception("Usage: " + Indexer.class.getName() + " <index dir> <data dir>");
        }
        File indexDir = new File(args[0]);
        File dataDir = new File(args[1]);
        index(indexDir, dataDir);
    }
}

In this example, two fields are part of each document: the contents of the text file and the full file name. The contents field gets some extra special treatment under the covers as the StandardAnalyzer, which is discussed below, processes it. The filename field is indexed as is. There are still more explanations about what is going on, of course. The Field static methods, Text, and Keyword will be explained in detail after we take a quick look inside a Lucene index.

Lucene Index Anatomy

The Lucene index format is a directory structure of several files. You can successfully use Lucene without understanding this directory structure. Feel free to skip this section and treat the directory as a black box without regard to what is inside. When you are ready to dig deeper you'll find that the files you created in the last section contain statistics and other data to facilitate rapid searching and ranking. An index contains a sequence of documents. In our indexing example, each document represents information about a text file.

Documents

Documents are the primary retrievable units from a Lucene query. Documents consist of a sequence of fields. Fields have a name ("contents" and "filename" in our example). Field values are a sequence of terms.

Terms

A term is the smallest piece of a particular field. Fields have three
attributes of interest:

Stored -- Original text is available in the documents returned from a search.
Indexed -- Makes this field searchable.
Tokenized -- The text added is run through an analyzer and broken into relevant
pieces (only makes sense for indexed fields).

Stored fields are handy for immediately having the original text available
from a search, such as a database primary key or filename. Stored fields
can dramatically increase the index size, so use them wisely. Indexed
field information is stored extremely efficiently, such that the same term
in the same field name across multiple documents is only stored once, with pointers to the documents that
contain it.

The Field class has a few static methods to construct fields with combinations
of the various attributes. They are:

Field.Keyword -- Indexed and stored, but not tokenized. Keyword fields
are useful for data like filenames, part numbers, primary keys, and other text that needs to stay intact as is.
Field.Text -- Indexed and tokenized. The text is also stored if added as a String, but
not stored if added as a Reader.
Field.UnIndexed -- Only stored. UnIndexed fields are not searchable.
Field.UnStored -- Indexed and tokenized, but not stored. UnStored fields are ideal for
text you want to be searchable but want to maintain the original text elsewhere or it is not needed for immediate display
from search results.

Up to now, Lucene seems relatively simple. But don't be fooled into thinking that there is not much to what is under the covers. It's actually quite sophisticated. The heart of this sophistication comes in the analysis of text, and how terms are pulled from the field data.

Analysis

Tokenized fields are where the real fun happens. In our example, we are indexing the contents of text files. The goal is to have the words in the text file be searchable, but for practical purposes it doesn't make sense to index every word. Some words like "a", "and", and "the" are generally considered irrelevant for searching and can be optimized out -- these are called stop words.

Does case matter for searching? What are word boundaries? Are acronyms, email addresses, URLs, and other such textual constructs kept intact and made searchable? If a singular word is indexed, should searching on the plural form return the document? These are all very interesting and complex questions to ask when deciding on which analyzer to use, or whether to create your own.

In our example, we use Lucene's built-in StandardAnalyzer,

but there are other built-in analyzers as well as some optional ones (found currently in the Lucene "sandbox" CVS repository) that can be used. Here is some code that explores what several of these analyzers do to two different text strings:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.StringReader;
import java.io.IOException;

public class AnalysisDemo {
    private static final String[] strings = {
        "The quick brown fox jumped over the lazy dogs",
        "XY&Z Corporation - xyz@example.com"
    };

    private static final Analyzer[] analyzers = new Analyzer[]{
        new WhitespaceAnalyzer(),
        new SimpleAnalyzer(),
        new StopAnalyzer(),
        new StandardAnalyzer(),
        new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS),
    };

    public static void main(String[] args) throws IOException {
        for (int i = 0; i < strings.length; i++) {
            analyze(strings[i]);
        }
    }

    private static void analyze(String text) throws IOException {
        System.out.println("Analzying "" + text + """);
        for (int i = 0; i < analyzers.length; i++) {
            Analyzer analyzer = analyzers[i];
            System.out.println("\t" + analyzer.getClass().getName() + ":");
            System.out.print("\t\t");
            TokenStream stream = analyzer.tokenStream("contents", new StringReader(text));
            while (true) {
                Token token = stream.next();
                if (token == null) break;

                System.out.print("[" + token.termText() + "] ");
            }
            System.out.println("\n");
        }
    }

}

The analyze method is using Lucene's API in an exploratory fashion. Your indexing code would not need to see the results of textual analysis, but it is helpful to see the terms that result from the various analyzers. Here are the results:

Analzying "The quick brown fox jumped over the lazy dogs"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
        [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] 

    org.apache.lucene.analysis.SimpleAnalyzer:
        [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] 

    org.apache.lucene.analysis.StopAnalyzer:
        [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] 

    org.apache.lucene.analysis.standard.StandardAnalyzer:
        [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] 

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
        [quick] [brown] [fox] [jump] [over] [lazi] [dog] 

Analzying "XY&Z Corporation - xyz@example.com"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
        [XY&Z] [Corporation] [-] [xyz@example.com] 

    org.apache.lucene.analysis.SimpleAnalyzer:
        [xy] [z] [corporation] [xyz] [example] [com] 

    org.apache.lucene.analysis.StopAnalyzer:
        [xy] [z] [corporation] [xyz] [example] [com] 

    org.apache.lucene.analysis.standard.StandardAnalyzer:
        [xy&z] [corporation] [xyz@example] [com] 

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
        [xy&z] [corpor] [xyz@exampl] [com]

The WhitespaceAnalyzer is the most basic, simply separating tokens based on, of course, whitespace. Note that not even capitalization was changed. Searches are case-sensitive, so a general best practice is to lowercase text during the analysis phase. The rest of the analyzers do lowercase as part of the process. The SimpleAnalyzer splits text at non-character boundaries, such as special characters ('&', '@', and '.' in the second demo string). The StopAnalyzer builds upon the features of the SimpleAnalyzer
and also removes common English stop words.

The most sophisticated analyzer built into Lucene's core is StandardAnalyzer. Under the covers it is a JavaCC-based parser with rules for email addresses, acronyms, hostnames, floating point numbers, as well as the lowercasing and stop word removal. Analyzers build upon a chaining-filter architecture, allowing single-purpose rules to be combined.

The SnowballAnalyzer illustrated is not currently a built-in Lucene feature. It is part of the source code available in the jakarta-lucene-sandbox CVS repository. It has the most peculiar results of all analyzers shown. The algorithm is language-specific, using stemming. Stemming algorithms attempt to reduce a word to a common root form. This is seen with "lazy" being reduced to "lazi". The word "laziness" would also be reduced to "lazi", allowing searches for either word to find documents containing the other. Another interesting example of the SnowballAnalzyer in action is on the text "corporate corporation corporations corpse", which yielded these results:

[corpor] [corpor] [corpor] [corps]

This was not the case for a lot of .com's, which became synonymous with "corpse," although the stemming algorithm sees the difference.

There is far more to textual analysis than is covered here. It is the topic of many dissertations and patents, and certainly ongoing research. Let's now turn our attention to searching, with the knowledge of how tokens are pulled from the original text.

Searching

To match our indexing example, a Searcher class was created
to display search results from the same index. Its skeleton main is
shown here:

import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;

public class Searcher {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            throw new Exception("Usage: " + Searcher.class.getName() + " <index dir> <query>");
        }

        File indexDir = new File(args[0]);
        String q = args[1];

        if (!indexDir.exists() || !indexDir.isDirectory()) {
            throw new Exception(indexDir + " is does not exist or is not a directory.");
        }

        search(indexDir, q);
    }
}

Again, we see nothing exciting here, just grabbing the command-line arguments
representing the index directory (which must have previously been created)
and the query to use. The interesting stuff happens in the search method:

public static void search(File indexDir, String q)  throws Exception{
    Directory fsDir = FSDirectory.getDirectory(indexDir, false);
    IndexSearcher is = new IndexSearcher(fsDir);

    Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());
    Hits hits = is.search(query);
    System.out.println("Found " + hits.length() + " document(s) that matched query '" + q + "':");
    for (int i = 0; i < hits.length(); i++) {
        Document doc = hits.doc(i);
        System.out.println(doc.get("filename"));
    }
}

Through Lucene's API, a Query object instance is created and handed to an IndexSearcher.search method. The Query object can be constructed through the API using the built-in Query subclasses:

TermQuery
BooleanQuery
PrefixQuery
WildcardQuery
RangeQuery
and a few others.

In our search method, though, we are using the QueryParser to parse a user-entered query. QueryParser is a sophisticated JavaCC-based parser to turn Google-like search expressions into Lucene's API representation of a Query. Lucene's expression syntax is documented on the Lucene web site (see Resources); expressions may
contain boolean operators, per-field queries, grouping, range queries, and more. An example query expression "+java -microsoft", which returns hits for documents that contain the word "java" but not the word "microsoft." QueryParser.query requires the developer specify the default field for searching, and in this case we specified the "contents" field. This would be equivalent to querying for "+contents:java - contents:microsoft", but allowing for it to be more user friendly.

The developer must also specify the analyzer to be used for tokenizing the query. In this case we use StandardAnalyzer, which is the same analyzer used for indexing. Typically the same analyzer should be used for both indexing and QueryParser searching. If we had used the SnowballAnalyzer as shown in the analysis examples, this would enable "laziness" searches to find the "quick brown fox" document.

After searching, a Hits collection is returned. The hits returned are ordered by Lucene's determination of score. It is beyond the scope of this article to delve into Lucene scoring, but rest assured that its default behavior is plenty good enough for the majority of applications, and it can be customized in the rare cases that the default behavior is insufficient.

The Hits collection is itself not an actual collection of the documents that match the search. This is done for high-performance reasons. It is a simple method call to retrieve the document though. In our example we display the filename field value for each document that matches the query.

Summary

Lucene is a spectacularly top-notch piece of work. Even with its wondrous capabilities, it requires developer ingenuity to build applications around it. We've seen a glimpse of the decisions that developers need to make with the choice of analyzers. There is more to it than this choice, though. Here are some questions to ponder as you consider adding Lucene to your projects:

What are my actual "documents"? (perhaps database rows or paragraphs rather than entire files)
What are the fields that make up my documents?
How do users want to search for documents?

I love programming

Search This Blog