Wednesday, February 7, 2018

Lucene vs Solr

Lucene vs Solr:
A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can't drive an engine, but you can drive a car. Similarly, Lucene is a programmatic library which you can't use as-is, whereas Solr is a complete application which you can use out-of-box.

What is Solr?
Apache Solr is a web application built around Lucene with all kinds of goodies.

It adds functionality like

XML/HTTP and JSON APIs
Hit highlighting
Faceted Search and Filtering
Geospatial Search
Fast Incremental Updates and Index Replication
Caching
Replication
Web administration interface etc
Unlike Lucene, Solr is a web application (WAR) which can be deployed in any servlet container, e.g. Jetty, Tomcat, Resin, etc.


Lucene: Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.

Lucene:
Let's look at a sample corpus of five documents:
My sister is coming for the holidays.
The holidays are a chance for family meeting.
Who did your sister meet?
It takes an hour to make fudge.
My sister makes awesome fudge.

What does Lucene do? Lucene is a full text search library.
Search has two principal stages: indexing and "retrieval."

During indexing, each document is broken into words, and the list of documents containing each word is stored in a list called the "postings list".
The posting list for the word "My" is:
My --> 1,5
And the posting list for the word "fudge" is:
fudge --> 4,5
The index consists of all the posting lists for the words in the corpus.
Indexing must be done before retrieval, and we can only retrieve documents that were indexed.

Retrieval is the process starting with a query and ending with a ranked list of documents. Say the query is [my fudge]. (The brackets denote the borders of the query). In order to find matches for the query, we break it into the individual words, and go to the posting lists. The full list of documents containing the keywords is [1,4,5]. Because document 5 contains both words and documents 1 and 4 contain just a single word from the query, a possible ranking is: 5, 1, 4 (document 5 appears first, then document 4, then document 1).
In general, indexing is a batch, preprocessing stage, and retrieval is a quick online stage, but there are exceptions.

This is the gist of Lucene. The rest of Lucene is (many important) specific bells and whistles for the indexing and retrieval processes.
https://www.quora.com/What-is-an-intuitive-description-of-how-Lucene-works

No comments:

Post a Comment