I love programming

Posts

Showing posts from March, 2014

Journal of Learning Apache Lucene - Boosting documents and fields

Not all documents and fields are created equal—or at least you can make sure that’s the case by using boosting. Boosting may be done during indexing or during searching. #1 Boosting Documents Document boosting is a feature that makes such a requirement simple to implement. By default, all documents have no boost—or, rather, they all have the same boost factor of 1.0. By changing a document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other documents in the index when computing relevance. For example: if (isImportant(lowerDomain)) { doc.setBoost(1.5F); } else if (isUnimportant(lowerDomain)) { doc.setBoost(0.1F); } #2 Boosting fields Just as you can boost documents, you can also boost individual fields. When you boosta document, Lucene internally uses the same boost factor to boost each of its fields. Imagine that another requirement for the email-indexing application is to consider the subject field more important than t...

Journal of Learning Apache Lucene - the core indexing classes

IndexWriter Directory Analyzer Document Field #1 IndexWriter IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn't let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for. #2 Directory The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor. Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, usi...

Journal of Learning Apache Lucene - the core searching classes

The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation: IndexSearcher Term Query TermQuery TopDocs #1 IndexSearcher IndexSearcher is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcher as a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a Query object and an int topN count as parameters and returns a TopDocs object. A typical use of this method looks like this: // open the folder holds the index Directory dir = FSDirectory.open(new File("/tmp/index")); IndexSearcher searcher = new IndexSearcher(dir); Query q = new TermQuery( new Term ("contents", ...

Journal of Learning Apache Lucene - Lucene In Action

source: https://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Lucene is a high-performance, scalable, search engine technology. Both indexing and searching features make up the Lucene API. The first part of this article takes you through an example of using Lucene to index all the text files in a directory and its subdirectories. Before proceeding to examples of analysis and searching, we'll take a brief detour to discuss the format of the index directory. Indexing We'll begin by creating the Indexer class that will be used to index all the text files in a specified directory. This class is a utility class with a single public method index () that takes two arguments. The first argument is a File object indexDir that corresponds to the directory where the index will be created. The second argument is another File object dataDir that corresponds to the directory to be indexed. public stati...