Oct 28, 2019 - #Lucene
We will be using the following libraries:
compile group: 'org.apache.lucene', name: 'lucene-core', version: '8.2.0'
compile group: 'org.apache.lucene', name: 'lucene-queryparser', version: '8.2.0'
compile group: 'org.apache.lucene', name: 'lucene-suggest', version: '8.2.0'
Directory indexDirectory = FSDirectory.open(Paths.get("./index/"));
The index will be saved to disk in this folder
Analyzer analyzer = new StandardAnalyzer();
The analyzer defines how the document will be analyzed. The StandardAnalyzer uses the Lucene StandardTokenizer with LowerCaseFilter and StopFilter.
This means all tokens will be normalized to lowercase and stop-words will be removed.
The basics on analysis is also describe in the Lucene docs: Analysis overview
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter indexWriter = new IndexWriter(indexDirectory, config);
Document document = new Document();
document.add(new TextField("field", "value", Field.Store.YES));
indexWriter.addDocument(document);
Make sure to only create the index writer once, not for every indidivual document.
To describe I used a TextField to tokenize and index the field. However there are other field types as described here: Lucene Field Documentation
Directory indexDirectory = [...]; // Directory with index
IndexReader indexReader = DirectoryReader.open(indexDirectory);
Analyzer analyzer = [...]; // Same Analyzer used to build the index
String[] fields = [...]; // Array of all the fields to search
int numberOfResults = 10;
IndexSearcher searcher = new IndexSearcher(indexReader);
Query query = new MultiFieldQueryParser(fields, analyzer).parse(queryString);
TopDocs topDocs = searcher.search(query, numberOfResults);
// Extracting the results
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc.toString());
}
This setup searches multiple fields of all documents.
Sometimes you might want to change the ranking of the documents based on in which fields the query string was found.
Lucene calls this concept ‘Score Boosting’. The score boost will be multiplied into the total score of the result.
Boosting can be done directly when creating the query:
String[] fields = {"importantField", "normalField"};
Map<String, Float> boosts = new HashMap<>();
boosts.put("importantField", 1.5f);
boosts.put("normalField", 1.0f);
Query query = new MultiFieldQueryParser(fields, analyzer, boosts)
.parse(queryString);
Unfortunately this does not work for Prefix/Wildcard queries. However there is a workaround. This will change
the implementation to consider the boost using a less efficient algorithm.
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields, analyzer, boosts);
queryParser.setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
Lucene can provide suggestions to the user. For example if the query contains a spelling mistake, Lucene can
suggest the correct spelling.
To do this, the lucene-suggest module is needed. Using the already existing index of documents, we can build a
LuceneDictionary and use this as an input for our suggester.
IndexReader indexReader = [...]; // Reader to the existing index
Analyzer analyzer = [...]; // Analyzer for the suggester
String filePrefix = [...]; // Prefix used to save the suggester dictionary
String field = [...]; // Name of the field that should be used for suggestions
Lookup suggester = new FuzzySuggester(indexDirectory, filePrefix, analyzer);
LuceneDictionary dictionary = new LuceneDictionary(indexReader, field);
suggester.build(dictionary);
As a suggester we are using FuzzySuggester. An alternative is AnalyzingInfixSuggester which will consider infix substrings as well.
Then the suggestion lookup can be done like this:
int numberOfResults = 10;
List<Lookup.LookupResult> lookup = suggester.lookup(queryString, false, numberOfResults);
for (Lookup.LookupResult lookupResult : lookup) {
System.out.println(lookupResult.key);
}