It is exciting to read the latest research advances in the computational linguistics. In particular, the better language models we build, the more accurate downstream NLP systems we can design.
Update: if you are looking to run neural search with latest Solr versions(starting version 8.x), I have just published a new blog where I walk you through low-level implementation of vector format and search, and the story of upgrading from 6.x to 8.x: https://firstname.lastname@example.org/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
Having background in production systems I have a strong conviction, that it is important to deploy latest theoretical achievements into real life systems. This allows you to:
- see NLP in action in practical systems
- identify possible shortcomings and continue research and experimentation in the promising directions
- iterate to achieve better performance: quality, speed and other important parameters, like memory consumption
For this story I’ve chosen to deploy BERT — language model by Google — into Apache Solr — production grade search engine — to implement neural search. Traditionally the out of the box search engines are using some sort of TF-IDF — and lately BM25 — based ranking of found documents. TF-IDF for instance is based on computing a cosine similarity between two vector representations of a query and a document. It operates over the space of TFs — term frequencies, and IDFs — inverse document frequencies, which combined tend to favour documents with more signal to noise ratio with respect to an input query. BM25 offers improvements to this model, especially around the term frequency saturation problem. The buzzing topic of these days is neural search as an alternative to TF-IDF/BM25. That is, use some neural model for encoding the query and documents and computing a similarity measure(not necessarily cosine) based on these encodings. BERT is the Bidirectional Encoder Representations for Transformers and in the words of its authors “the first deeply bidirectional, unsupervised language representation, pretrained using only a plain text corpus”. In practice, it means, that BERT will encode the “meaning” of a word, like bank in the sentence “I accessed the bank account” using the previous context “I accessed the”…