Low-level testing your Lucene TokenFilters

Dmitry Kan
5 min readOct 24, 2024

--

Another re-blog: this time about Lucene’s TokenFilter’s (originally published in 9 June 2014). For those into neural search from scratch, I also wrote this piece, that deals with embeddings on Lucene level.

At the recent Berlin buzzwords conference talk on Apache Lucene 4 Robert Muir mentioned the Lucene’s internal testing library. This library is essentially the collection of classes and methods that form the test bed for Lucene committers. But, as a matter of fact, the same library can be perfectly used in your own code. David Weiss has talked about randomized testing with Lucene, which is not the focus of this post, but is really a great way of running your usual static tests with randomization.

This post will show a few code snippets, that illustrate the usage of the Lucene test library for verifying the consistency of your custom TokenFilters on a lower level, than you might be used to.

Credits: http://blog.csdn.net/caoxu1987728/article/details/3294145
I’m putting this fancy graph to prove, that posts with images are opened more often, than those without. Ok, it has relevant parts too: in particular we are looking into creating our own TokenFilter in parallel to StopFilter, LowerCaseFilter, StandardFilter and PorterStemFilter

In the naming convention spirit of the previous post, where custom classes started with GroundShaking prefix, let’s create our own MindBlowingTokenFilter class. For the sake of illustration, our token filter will take each term from the term stream, add “mindblowing” suffix to it and store in the stream as a new term. This class will be the basis for writing unit-tests.

package com.dmitrykan.blogspot;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;

import java.io.IOException;

/**
* Created by dmitry on 6/9/14.
*/
public final class MindBlowingTokenFilter extends TokenFilter {

private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posAtt;
// dummy thing, is needed for complying with BaseTokenStreamTestCase assertions
private PositionLengthAttribute posLenAtt; // don't remove this, otherwise the low-level test will fail

private State save;

public static final String MIND_BLOWING_SUFFIX = "mindblowing";

/**
* Construct a token stream filtering the given input.
*
* @param input
*/
protected MindBlowingTokenFilter(TokenStream input) {
super(input);
this.termAtt = addAttribute(CharTermAttribute.class);
this.posAtt = addAttribute(PositionIncrementAttribute.class);
this.posLenAtt = addAttribute(PositionLengthAttribute.class);
}

@Override
public boolean incrementToken() throws IOException {
if( save != null ) {
restoreState(save);
save = null;
return true;
}

if (input.incrementToken()) {
// pass through zero-length terms
int oldLen = termAtt.length();
if (oldLen == 0) return true;
int origOffset = posAtt.getPositionIncrement();

// save original state
posAtt.setPositionIncrement(0);
save = captureState();

//char[] origBuffer = termAtt.buffer();

char [] buffer = termAtt.resizeBuffer(oldLen + MIND_BLOWING_SUFFIX.length());

for (int i = 0; i < MIND_BLOWING_SUFFIX.length(); i++) {
buffer[oldLen + i] = MIND_BLOWING_SUFFIX.charAt(i);
}

posAtt.setPositionIncrement(origOffset);
termAtt.copyBuffer(buffer, 0, oldLen + MIND_BLOWING_SUFFIX.length());

return true;
}
return false;
}
}

The next thing we would like to do is to write a Lucene-level test suite for this class. We will extend it from BaseTokenStreamTestCase, not standard TestCase or other class from a testing framework you might have used to deal with. The reason being we’d like to utilize the internal Lucene’s test functionality, that lets you access and cross-check the lower-level items, like term position increments, position lengths, position start and end offsets etc.

About the same information you can see with Apache Solr’s analysis page, if you enable verbose mode. While the analysis page is good to visually debug your code, the unit test is meant to run for you every time you change and build your code. If you decide to first visually examine the term positions, start and end offsets with Solr, you’ll need to wrap the token filter into factory and register it in the schema on your field type. The factory code:

package com.dmitrykan.blogspot;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;

import java.util.Map;

/**
* Created by dmitry on 6/9/14.
*/
public class MindBlowingTokenFilterFactory extends TokenFilterFactory {
public MindBlowingTokenFilterFactory(Map args) {
super(args);
}

public MindBlowingTokenFilter create(TokenStream input) {
return new MindBlowingTokenFilter(input);
}

}

Here is the test class in all its glory:

package com.dmitrykan.blogspot;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.MockTokenizer;
import org.apache.lucene.analysis.Tokenizer;

import java.io.IOException;
import java.io.Reader;

/**
* Created by dmitry on 6/9/14.
*/
public class TestMindBlowingTokenFilter extends BaseTokenStreamTestCase {
private Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new MockTokenizer(reader, MockTokenizer.WHITESPACE, true);
return new TokenStreamComponents(source, new MindBlowingTokenFilter(source));
}
};

public void testPositionIncrementsSingleTerm() throws IOException {

String output[] = {"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int posIncrements[] = {1, 0};
// this is dummy stuff, but the test does not run without it
int posLengths[] = {1, 1};

assertAnalyzesToPositions(analyzer, "queries", output, posIncrements, posLengths);
}

public void testPositionIncrementsTwoTerm() throws IOException {

String output[] = {"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your", "queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int posIncrements[] = {1, 0, 1, 0};
// this is dummy stuff, but the test does not run without it
int posLengths[] = {1, 1, 1, 1};

assertAnalyzesToPositions(analyzer, "your queries", output, posIncrements, posLengths);
}

public void testPositionIncrementsFourTerms() throws IOException {

String output[] = {
"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your",
"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries",
"are" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "are",
"fast" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "fast"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int posIncrements[] = {
1, 0,
1, 0,
1, 0,
1, 0};
// this is dummy stuff, but the test does not run without it
int posLengths[] = {
1, 1,
1, 1,
1, 1,
1, 1};

// position increments are following the 1-0 pattern, because for each next term we insert a new term into
// the same position (i.e. position increment is 0)
assertAnalyzesToPositions(analyzer, "your queries are fast", output, posIncrements, posLengths);
}

public void testPositionOffsetsFourTerms() throws IOException {

String output[] = {
"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your",
"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries",
"are" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "are",
"fast" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "fast"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int startOffsets[] = {
0, 0,
5, 5,
13, 13,
17, 17};
// this is dummy stuff, but the test does not run without it
int endOffsets[] = {
4, 4,
12, 12,
16, 16,
21, 21};

assertAnalyzesTo(analyzer, "your queries are fast", output, startOffsets, endOffsets);
}

}

All tests should pass and yes, the same numbers are present on the Solr’s analysis page:

Solr’s admin page with token analysis

Happy unit testing with Lucene!

your @dmitrykan

--

--

Dmitry Kan

Founder and host of Vector Podcast, tech team lead, software engineer, manager, but also: cat lover and cyclist. Host: https://www.youtube.com/c/VectorPodcast