Skip to main content

Journal of Learning Apache Lucene - the core indexing classes


  • IndexWriter
  • Directory
  • Analyzer
  • Document
  • Field

#1 IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn't let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.

#2 Directory
The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor. Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, using an analyzer.

#3 Analyzer
Before text is indexed, it’s passed through an analyzer. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. If the content to be indexed isn't plain text, you should first extract plain text from it before indexing. Analyzer
is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don’t help distinguish one document from the other, such as a, an, the, in, and on); some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple
input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design.

#4 Document
The analysis process requires a document, containing separate fields to be indexed. The Document class represents a collection of fields. Think of it as a virtual document— a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or metadata associated with that document. The original source (such as a database
record, a Microsoft Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. It’s the text that you extract from such binary documents, and add as a Field instance, that Lucene processes. The metadata (such as author, title, subject and date modified) is indexed and stored separately as fields of a document.

#5 Field
Each document in an index contains one or more named fields, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, that control precisely how Lucene will index the field’s value. A document may have more than one field with the same name. In this case, the values
of the fields are appended, during indexing, in the order they were added to the document. When searching, it’s exactly as if the text from all the fields were concatenated and treated as a single text field.



Comments

Popular posts from this blog

Stretch a row if data overflows in jasper reports

It is very common that some columns of the report need to stretch to show all the content in that column. But  if you just specify the property " stretch with overflow' to that column(we called text field in jasper report world) , it will just stretch that column and won't change other columns, so the row could be ridiculous. Haven't find the solution from internet yet. So I just review the properties in iReport one by one and find two useful properties(the bold highlighted in example below) which resolve the problems.   example:
<band height="20" splitType="Stretch"> <textField isStretchWithOverflow="true" pattern="" isBlankWhenNull="true"> <reportElement stretchType="RelativeToTallestObject" mode="Opaque" x="192" y="0" width="183" height="20"/> <box leftPadding="2"> <pen lineWidth="0.25"/> …

JasperReports - Configuration Reference

Spring - Operations with jdbcTemplate

This class manages all the database communication and exception handling using a java.sql.Connection that is obtained from the provided DataSource. JdbcTemplate is a stateless and threadsafe class and you can safely instantiate a single instance to be used for each DAO.


Use of Callback Methods
JdbcTemplate is based on a template style of programming common to many other parts of Spring. Some method calls are handled entirely by the JdbcTemplate, while others require the calling class to provide callback methods that contain the implementation for parts of the JDBC workflow. This is another form of Inversion of Control. Your application code hands over the responsibility of managing the database access to the template class. The template class in turn calls back to your application code when it needs some detail processing filled in. These callback methods are allowed to throw a java.sql.SQLException, since the framework will be able to catch this exception and use its built-in excepti…