Journal of Learning Apache Lucene - the core indexing classes

IndexWriter
Directory
Analyzer
Document
Field

#1 IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn't let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.

#2 Directory
The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor. Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, using an analyzer.

#3 Analyzer
Before text is indexed, it’s passed through an analyzer. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. If the content to be indexed isn't plain text, you should first extract plain text from it before indexing. Analyzer
is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don’t help distinguish one document from the other, such as a, an, the, in, and on); some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple
input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design.

#4 Document
The analysis process requires a document, containing separate fields to be indexed. The Document class represents a collection of fields. Think of it as a virtual document— a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or metadata associated with that document. The original source (such as a database
record, a Microsoft Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. It’s the text that you extract from such binary documents, and add as a Field instance, that Lucene processes. The metadata (such as author, title, subject and date modified) is indexed and stored separately as fields of a document.

#5 Field
Each document in an index contains one or more named fields, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, that control precisely how Lucene will index the field’s value. A document may have more than one field with the same name. In this case, the values
of the fields are appended, during indexing, in the order they were added to the document. When searching, it’s exactly as if the text from all the fields were concatenated and treated as a single text field.

I love programming

Search This Blog

Journal of Learning Apache Lucene - the core indexing classes

Labels

Comments

Post a Comment

Popular posts from this blog

JasperReports - Configuration Reference

Stretch a row if data overflows in jasper reports

Live - solving the jasper report out of memory and high cpu usage problems