Skip to main content

Journal of Learning Apache Lucene - High Level Concepts

It’s crucial to recognize that Lucene is simply a search library, and you’ll need to handle the other components of a search application (crawling, document filtering, runtime server, user interface, administration, etc.) as your application requires.

#1 Typical components of search application

It’s important to grasp the big picture so that you have a clear understanding of which parts Lucene can handle and which parts your application must separately handle. A common misconception is that Lucene is an entire search application, when in fact it’s simply the core indexing and searching component. In the figure above , only the shaded components show are handled by Lucene.

The first step, at the bottom of the above figure, is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed. That may be trivial, for example, if you’re indexing a set of XML files that resides in a specific directory in the file system or if all your content resides in a wellorganized database. Alternatively, it may be horribly complex and messy if the content is scattered in all sorts of places (file systems, content management systems, Microsoft Exchange, Lotus Domino, various websites, databases, local XML files, CGI scripts running on intranet servers, and so forth).

Once you have the raw content that needs to be indexed, you must translate the content into the units (usually called documents) used by the search engine. The document typically consists of several separately named fields with values, such as title, body, abstract, author, and url. You’ll have to carefully design how to divide the raw content into documents and fields as well as how to compute the value for each of those fields. Often the approach is obvious: one email message becomes one document, or one PDF file or web page is one document. But sometimes it’s less clear: how should you handle attachments on an email message? Should you glom together all text extracted from the attachments into a single document, or make separate documents, somehow linked back to the original email message, for each attachment.

No search engine indexes text directly: rather, the text must be broken into a series of individual atomic elements called tokens. This is what happens during the Analyze Document step. Each token corresponds roughly to a “word” in the language, and this step determines how the textual fields in the document are divided into a series of tokens. There are all sorts of interesting questions here: how do you handle compound words? Should you apply spell correction (if your content itself has typos)? and so on.

 Lucene provides an array of built-in analyzers that give you fine control over this process. It’s also straightforward to build your own analyzer, or create arbitrary analyzer chains combining Lucene’s tokenizers and token filters, to customize how tokens are created. The final step is to index the document.

#5 Indexing
Suppose you need to search a large number of files, and you want to find files that contain a certain word or a phrase. How would you go about writing a program to do this? A na?ve approach would be to sequentially scan each file for the given word or phrase. Although this approach would work, it has a number of flaws, the most obvious of which is that it doesn’t scale to larger file sets or cases where files are very large. Here’s where indexing comes in: to search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index.

During the indexing step, the document is added to the index. Lucene provides everything
necessary for this step, and works quite a bit of magic under a surprisingly simple

When you manage to entice a user to use your search application, she or he issues a search request, often as the result of an HTML form or Ajax request submitted by a browser to your server. You must then translate the request into the search engine’s Query object. We call this the Build Query step. Query objects can be simple or complex. Lucene provides a powerful package, called QueryParser, to process the user’s text into a query object according to a common search syntax.

Run Query is the process of consulting the search index and retrieving the documents matching the Query, sorted in the requested sort order. This component covers the complex inner workings of the search engine, and Lucene handles all of it for you. Lucene is also wonderfully extensible at this point, so if you’d like to customize how results are gathered, filtered, sorted, and so forth, it’s straightforward.

There are three common theoretical models of search:

  •  Pure Boolean model—Documents either match or don’t match the provided query, and no scoring is done. In this model there are no relevance scores associated with matching documents, and the matching documents are unordered; a query simply identifies a subset of the overall corpus as matching the query. 
  • Vector space model—Both queries and documents are modeled as vectors in a high dimensional space, where each unique term is a dimension. Relevance, or similarity, between a query and a document is computed by a vector distance measure between these vectors.

  • Probabilistic model—In this model, you compute the probability that a document is a good match to a query using a full probabilistic approach. Lucene’s approach combines the vector space and pure Boolean models, and offers you controls to decide which model you’d like to use on a search-by-search basis.Finally, Lucene returns documents that you next must render in a consumable way for your users.
Once you have the raw set of documents that match the query, sorted in the right order, you then render them to the user in an intuitive, consumable manner. The UI should also offer a clear path for follow-on searches or actions, such as clicking to the next page, refining the search, or finding documents similar to one of the matches, so that the user never hits a dead end.


Popular posts from this blog

Stretch a row if data overflows in jasper reports

It is very common that some columns of the report need to stretch to show all the content in that column. But  if you just specify the property " stretch with overflow' to that column(we called text field in jasper report world) , it will just stretch that column and won't change other columns, so the row could be ridiculous. Haven't find the solution from internet yet. So I just review the properties in iReport one by one and find two useful properties(the bold highlighted in example below) which resolve the problems.   example:
<band height="20" splitType="Stretch"> <textField isStretchWithOverflow="true" pattern="" isBlankWhenNull="true"> <reportElement stretchType="RelativeToTallestObject" mode="Opaque" x="192" y="0" width="183" height="20"/> <box leftPadding="2"> <pen lineWidth="0.25"/> …

JasperReports - Configuration Reference

Live - solving the jasper report out of memory and high cpu usage problems

I still can not find the solution. So I summary all the things and tell my boss about it. If any one knows the solution, please let me know.

Symptom: 1.The JVM became Out of memory when creating big consumption report 2.Those JRTemplateElement-instances is still there occupied even if I logged out the system
Reason:         1. There is a large number of JRTemplateElement-instances cached in the memory 2.The clearobjects() method in ReportThread class has not been triggered when logging out
Action I tried:      About the Virtualizer: 1.Replacing the JRSwapFileVirtualizer with JRFileVirtualizer 2.Not use any FileVirtualizer for cache the report in the hard disk Result: The japserreport still creating the a large number of JRTemplateElement-instances in the memory     About the work around below,      I tried: item 3(in below work around list) – result: it helps to reduce  the size of the JRTemplateElement Object                Item 4,5 – result : it helps a lot to reduce the number of  JRTemplateE…