Skip to main content

Journal of Learning Apache Lucene - the core indexing classes


  • IndexWriter
  • Directory
  • Analyzer
  • Document
  • Field

#1 IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn't let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.

#2 Directory
The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor. Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, using an analyzer.

#3 Analyzer
Before text is indexed, it’s passed through an analyzer. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. If the content to be indexed isn't plain text, you should first extract plain text from it before indexing. Analyzer
is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don’t help distinguish one document from the other, such as a, an, the, in, and on); some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple
input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design.

#4 Document
The analysis process requires a document, containing separate fields to be indexed. The Document class represents a collection of fields. Think of it as a virtual document— a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or metadata associated with that document. The original source (such as a database
record, a Microsoft Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. It’s the text that you extract from such binary documents, and add as a Field instance, that Lucene processes. The metadata (such as author, title, subject and date modified) is indexed and stored separately as fields of a document.

#5 Field
Each document in an index contains one or more named fields, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, that control precisely how Lucene will index the field’s value. A document may have more than one field with the same name. In this case, the values
of the fields are appended, during indexing, in the order they were added to the document. When searching, it’s exactly as if the text from all the fields were concatenated and treated as a single text field.



Comments

Popular posts from this blog

Stretch a row if data overflows in jasper reports

It is very common that some columns of the report need to stretch to show all the content in that column. But  if you just specify the property " stretch with overflow' to that column(we called text field in jasper report world) , it will just stretch that column and won't change other columns, so the row could be ridiculous. Haven't find the solution from internet yet. So I just review the properties in iReport one by one and find two useful properties(the bold  highlighted in example below) which resolve the problems.   example: <band height="20" splitType="Stretch" > <textField isStretchWithOverflow="true" pa...

Live - solving the jasper report out of memory and high cpu usage problems

I still can not find the solution. So I summary all the things and tell my boss about it. If any one knows the solution, please let me know. Symptom: 1.        The JVM became Out of memory when creating big consumption report 2.        Those JRTemplateElement-instances is still there occupied even if I logged out the system Reason:         1. There is a large number of JRTemplateElement-instances cached in the memory 2.     The clearobjects() method in ReportThread class has not been triggered when logging out Action I tried:      About the Virtualizer: 1.     Replacing the JRSwapFileVirtualizer with JRFileVirtualizer 2.     Not use any FileVirtualizer for c...

JasperReports - Configuration Reference

Data Source / Query Executer net.sf.jasperreports.csv.column.names.{arbitrary_name} net.sf.jasperreports.csv.date.pattern net.sf.jasperreports.csv.encoding net.sf.jasperreports.csv.field.delimiter net.sf.jasperreports.csv.locale.code net.sf.jasperreports.csv.number.pattern net.sf.jasperreports.csv.record.delimiter net.sf.jasperreports.csv.source net.sf.jasperreports.csv.timezone.id net.sf.jasperreports.ejbql.query.hint.{hint} net.sf.jasperreports.ejbql.query.page.size net.sf.jasperreports.hql.clear.cache net.sf.jasperreports.hql.field.mapping.descriptions net.sf.jasperreports.hql.query.list.page.size net.sf.jasperreports.hql.query.run.type net.sf.jasperreports.jdbc.concurrency net.sf.jasperreports.jdbc.fetch.size net.sf.jasperreports.jdbc.holdability net.sf.jasperreports.jdbc.max.field.size net.sf.jasperreports.jdbc.result.set.type net.sf.jasperreports.query.chunk.token.separators net.sf.jasperreports.query.executer.factory.{language} net.sf.jasperreports.xpath....