Meet Lucene, an Information Retrieval library, not a ready-to-use product, and that it most certainly is not a web crawler, as people new to Lucene sometimes think.
This article picks up where Meet Lucene left off, in searching an index. Here we will conclude the discussion on indexing and move on to working with the search API and considering alternative products.
1.5 Understanding the core indexing classes
As you saw in our Indexer class, you need the following classes to perform the simplest indexing procedure:
What follows is a brief overview of these classes, to give you a rough idea about their role in Lucene. We'll use these classes throughout this book.
IndexWriter is the central component of the indexing process. This class creates a new index and adds documents to an existing index. You can think of IndexWriter as an object that gives you write access to the index but doesn't let you read or search it. Despite its name, IndexWriter isn't the only class that's used to modify an index; section 2.2 describes how to use the Lucene API to modify an index.
The Directory class represents the location of a Lucene index. It's an abstract class that allows its subclasses (two of which are included in Lucene) to store the index as they see fit. In our Indexer example, we used a path to an actual file system directory to obtain an instance of Directory, which we passed to IndexWriter's constructor. IndexWriter then used one of the concrete Directory implementations, FSDirectory, and created our index in a directory in the file system.
In your applications, you will most likely be storing a Lucene index on a disk. To do so, use FSDirectory, a Directory subclass that maintains a list of real files in the file system, as we did in Indexer.
The other implementation of Directory is a class called RAMDirectory. Although it exposes an interface identical to that of FSDirectory, RAMDirectory holds all its data in memory. This implementation is therefore useful for smaller indices that can be fully loaded in memory and can be destroyed upon the termination of an application. Because all data is held in the fast-access memory and not on a slower hard disk, RAMDirectory is suitable for situations where you need very quick access to the index, whether during indexing or searching. For instance, Lucene's developers make extensive use of RAMDirectory in all their unit tests: When a test runs, a fast in-memory index is created or searched; and when a test completes, the index is automatically destroyed, leaving no residuals on the disk. Of course, the performance difference between RAMDirectory and FSDirectory is less visible when Lucene is used on operating systems that cache files in memory. You'll see both Directory implementations used in code snippets in this book.
Before text is indexed, it's passed through an Analyzer. The Analyzer, specified in the IndexWriter constructor, is in charge of extracting tokens out of text to be indexed and eliminating the rest. If the content to be indexed isn't plain text, it should first be converted to it, as depicted in figure 2.1. Chapter 7 shows how to extract text from the most common rich-media document formats. Analyzer is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don't help distinguish one document from the other, such as a, an, the, in, and on); some deal with conversion of tokens to lowercase letters, so that searches aren't case-sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design. You'll learn much more about them in chapter 4.
A Document represents a collection of fields. You can think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or meta-data associated with that document. The original source (such as a database record, a Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. The meta-data such as author, title, subject, date modified, and so on, are indexed and stored separately as fields of a document.
Note: When we refer to a document in this book, we mean a Microsoft Word, RTF, PDF, or other type of a document; we aren't talking about Lucene's Document class. Note the distinction in the case and font.
Lucene only deals with text. Lucene's core does not itself handle anything but java.lang.String and java.io.Reader. Although various types of documents can be indexed and made searchable, processing them isn't as straightforward as processing purely textual content that can easily be converted to a String or Reader Java type. You'll learn more about handling nontext documents in chapter 7.
In our Indexer, we're concerned with indexing text files. So, for each text file we find, we create a new instance of the Document class, populate it with Fields (described next), and add that Document to the index, effectively indexing the file.
Each Document in an index contains one or more named fields, embodied in a class called Field. Each field corresponds to a piece of data that is either queried against or retrieved from the index during search.
Lucene offers four different types of fields from which you can choose:
- Keyword—Isn't analyzed, but is indexed and stored in the index verbatim. This type is suitable for fields whose original value should be preserved in its entirety, such as URLs, file system paths, dates, personal names, Social Security numbers, telephone numbers, and so on. For example, we used the file system path in Indexer (listing 1.1) as a Keyword field.
- UnIndexed—Is neither analyzed nor indexed, but its value is stored in the index as is. This type is suitable for fields that you need to display with search results (such as a URL or database primary key), but whose values you'll never search directly. Since the original value of a field of this type is stored in the index, this type isn't suitable for storing fields with very large values, if index size is an issue.
- UnStored—The opposite of UnIndexed. This field type is analyzed and indexed but isn't stored in the index. It's suitable for indexing a large amount of text that doesn't need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.
- Text—Is analyzed, and is indexed. This implies that fields of this type can be searched against, but be cautious about the field size. If the data indexed is a String, it's also stored; but if the data (as in our Indexer example) is from a Reader, it isn't stored. This is often a source of confusion, so take note of this difference when using Field.Text.
All fields consist of a name and value pair. Which field type you should use depends on how you want to use that field and its values. Strictly speaking, Lucene has a single Field type: Fields are distinguished from each other based on their characteristics. Some are analyzed, but others aren't; some are indexed, whereas others are stored verbatim; and so on.
Table 1.2 provides a summary of different field characteristics, showing you how fields are created, along with common usage examples.
Table 1.2 An overview of different field types, their characteristics, and their usage
||Telephone and Social Security numbers, URLs, personal names|
||Document type (PDF, HTML, and so on), if not used as a search criteria|
||Document titles and content|
||Document titles and content|
||Document titles and content|
Notice that all field types can be constructed with two Strings that represent the field's name and its value. In addition, a Keyword field can be passed both a String and a Date object, and the Text field accepts a Reader object in addition to the String. In all cases, the value is converted to a Reader before indexing; these additional methods exist to provide a friendlier API.
Note: Note the distinction between Field.Text(String, String) and Field.Text(String, Reader). The String variant stores the field data, whereas the Reader variant does not. To index a String, but not store it, use Field.UnStored(String, String).
Finally, UnStored and Text fields can be used to create term vectors (an advanced topic, covered in section 5.7). To instruct Lucene to create term vectors for a given UnStored or Text field, you can use Field.UnStored(String, String, true), Field.Text(String, String, true), or Field.Text(String, Reader, true).
You'll apply this handful of classes most often when using Lucene for indexing. In order to implement basic search functionality, you need to be familiar with an equally small and simple set of Lucene search classes.
1.6 Understanding the core searching classes
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
The following sections provide a brief introduction to these classes. We'll expand on these explanations in the chapters that follow, before we dive into more advanced topics.
IndexSearcher is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcher as a class that opens an index in a read-only mode. It offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a single Query object as a parameter and returns a Hits object. A typical use of this method looks like this:
IndexSearcher is = new IndexSearcher(
Query q = new TermQuery(new Term("contents", "lucene"));
Hits hits = is.search(q);
We cover the details of IndexSearcher in chapter 3, along with more advanced information in chapters 5 and 6.
A Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the value of that field. Note that Term objects are also involved in the indexing process. However, they're created by Lucene's internals, so you typically don't need to think about them while indexing. During searching, you may construct Term objects and use them together with TermQuery:
Query q = new TermQuery(new Term("contents", "lucene"));
Hits hits = is.search(q);
This code instructs Lucene to find all documents that contain the word lucene in a field named contents. Because the TermQuery object is derived from the abstract parent class Query, you can use the Query type on the left side of the statement.
Lucene comes with a number of concrete Query subclasses. So far in this chapter we've mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery. All of these are covered in chapter 3. Query is the common, abstract parent class. It contains several utility methods, the most interesting of which is setBoost(float), described in section 3.5.9.
TermQuery is the most basic type of query supported by Lucene, and it's one of the primitive query types. It's used for matching documents that contain fields with specific values, as you've seen in the last few paragraphs.
The Hits class is a simple container of pointers to ranked search results—documents that match a given query. For performance reasons, Hits instances don't load from the index all documents that match a query, but only a small portion of them at a time. Chapter 3 describes this in more detail.
1.7 Review of alternate search products
Before you select Lucene as your IR library of choice, you may want to review other solutions in the same domain. We did some research into alternate products that you may want to consider and evaluate; this section summarizes our findings. We group these products in two major categories:
- Information Retrieval libraries
- Indexing and searching applications
The first group is smaller; it consists of full-text indexing and searching libraries similar to Lucene. Products in this group let you embed them in your application, as shown earlier in figure 1.5.
The second, larger group is made up of ready-to-use indexing and searching software. This software is typically designed to index and search a particular type of data, such as web pages, and is less flexible than software in the former group. However, some of these products also expose their lower-level API, so you can sometimes use them as IR libraries as well.
1.7.1 IR libraries
In our research for this chapter, we found two IR libraries—Egothor and Xapian—that offer a comparable set of features and are aimed at roughly the same audience: developers. We also found MG4J, which isn't an IR library but is rather a set of tools useful for building an IR library; we think developers working with IR ought to know about it. Here are our reviews of all three products.
A full-text indexing and searching Java library, Egothor uses core algorithms that are very similar to those used by Lucene. It has been in existence for several years and has a small but active developer and user community. The lead developer is Czech developer Leo Galambos, a PhD student with a solid academic background in the field of IR. He sometimes participates in Lucene's user and developer mailing list discussions.
Egothor supports an extended Boolean model, which allows it to function as both the pure Boolean model and the Vector model. You can tune which model to use via a simple query-time parameter. This software features a number of different query types, supports similar search syntax, and allows multithreaded querying, which can come in handy if you're working on a multi-CPU computer or searching remote indices.
The Egothor distribution comes with several ready-to-use applications, such as a web crawler called Capek, a file indexer with a Swing GUI, and more. It also provides parsers for several rich-text document formats, such as PDF and Microsoft Word documents. As such, Egothor and Capek are comparable to the Lucene/Nutch combination, and Egother's file indexer and document parsers are similar to the small document parsing and indexing framework presented in chapter 7 of this book.
Free, open source, and released under a BSD-like license, the Egothor project is comparable to Lucene in most aspects. If you have yet to choose a full-text indexing and searching library, you may want to evaluate Egothor in addition to Lucene. Egothor's home page is at http://www.egothor.org/; as of this writing, it features a demo of its web crawler and search functionality.
Xapian is a Probabilistic Information Retrieval library written in C++ and released under GPL. This project (or, rather, its predecessors) has an interesting history: The company that developed and owned it went through more than half a dozen acquisitions, name changes, shifts in focus, and such.
Xapian is actively developed software. It's currently at version 0.8.3, but it has a long history behind it and is based on decades of experience in the IR field. Its web site, http://www.xapian.org/, shows that it has a rich set of features, much like Lucene. It supports a wide range of queries and has a query parser that supports human-friendly search syntax; stemmers based on Dr. Martin Porter's Snowball project; parsers for a several rich-document types; bindings for Perl, Python, PHP, and (soon) Java; remote index searching; and so on.
In addition to providing an IR library, Xapian comes with a web site search application called Omega, which you can download separately.
Although MG4J (Managing Gigabytes for Java) isn't an IR library like Lucene, Egothor, and Xapian, we believe that every software engineer reading this book should be aware of it because it provides low-level support for building Java IR libraries. MG4J is named after a popular IR book, Managing Gigabytes: Compressing and Indexing Documents and Images, written by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. After collecting large amounts of web data with their distributed, fault-tolerant web crawler called UbiCrawler, its authors needed software capable of analyzing the collected data; out of that need, MG4J was born.
The library provides optimized classes for manipulating I/O, inverted index compression, and more. The project home page is at http://mg4j.dsi.unimi.it/; the library is free, open source, released under LGPL, and currently at version 0.8.2.
1.7.2 Indexing and searching applications
The other group of available software, both free and commercial, is assembled into prepackaged products. Such software usually doesn't expose a lot of its API and doesn't require you to build a custom application on top of it. Most of this software exposes a mechanism that lets you control a limited set of parameters but not enough to use the software in a way that's drastically different from its assumed use. (To be fair, there are notable exceptions to this rule.)
As such, we can't compare this software to Lucene directly. However, some of these products may be sufficient for your needs and let you get running quickly, even if Lucene or some other IR library turns out to be a better choice in the long run. Here's a short list of several popular products in this category:
1.7.3 Online resources
The previous sections provide only brief overviews of the related products. Several resources will help you find other IR libraries and products beyond those we've mentioned:
We've provided positive reviews of some alternatives to Lucene, but we're confident that your requisite homework will lead you to Lucene as the best choice!
In this chapter, you've gained some basic Lucene knowledge. You now know that Lucene is an Information Retrieval library, not a ready-to-use product, and that it most certainly is not a web crawler, as people new to Lucene sometimes think. You've also learned a bit about how Lucene came to be and about the key people and the organization behind it.
In the spirit of Manning's in Action books, we quickly got to the point by showing you two standalone applications, Indexer and Searcher, which are capable of indexing and searching text files stored in a file system. We then briefly described each of the Lucene classes used in these two applications. Finally, we presented our research findings for some products similar to Lucene.
Search is everywhere, and chances are that if you're reading this book, you're interested in search being an integral part of your applications. Depending on your needs, integrating Lucene may be trivial, or it may involve architectural considerations.
We've organized the next couple of chapters as we did this chapter. The first thing we need to do is index some documents; we discuss this process in detail in chapter 2.
About the Authors
Erik Hatcher codes, writes, and speaks on technical topics that he finds fun and challenging. He has written software for a number of diverse industries using many diffedifferentnologies and languages. Erik coauthored Java Development with Ant (Manning, 2002) with Steve Loughran, a book that has received wonderful industry acclaim. Since the release of Erik's first book, he has spoken at numerous venues including the No Fluff, Just Stuff symposium circuit, JavaOne, O'Reilly's Open Source Convention, the Open Source Content Management Conference, and many Java User Group meetings. As an Apache Software Foundation member, he is an active contributor and committer on several Apache projects including Lucene, Ant, and Tapestry. Erik currently works at the University of Virginia's Humanities department supporting Applied Research in Patacriticism.
Otis Gospodnetic has been an active Lucene developer for four years and maintains the jGuru Lucene FAQ. He is a Software Engineer at Wireless Generations, a company that develops technology solutions for educational assessments of students and teachers. In his spare time, he develops Simpy, a Personal Web Service that uses Lucene, which he created out of his passion for knowledge, information retrieval, and management. Previous technical publications include several articles about Lucene, published by O'Reilly Network and IBM developerWorks. Otis also wrote To Choose and Be Chosen: Pursuing Education in America, a guidebook for foreigners wishing to study in the United States; it's based on his own experience.
About the Book
Lucene in Action
by Erik Hatcher and Otis Gospodnetic
Foreword by Doug Cutting, the inventor of Lucene
Published December 2004, Softbound, 456 pages
Published by Manning Publications Co.
Retail price: $44.95
Ebook price: $22.50. To purchase the ebook go to http://www.manning.com/hatcher2.
This material is from Chapter 1 of the book.