Brice Thomas dev, code, nlp and stuff.

Inverted index in Scala

Last week I had to build an inverted index to speed up (a lot) a program doing parallel document identification. And I was quite impressed by how simple it was in Scala!

Inverted index?

It’s simple, let’s say you have this index:

For each line you have a document identifier followed by the words appearing in the document. Here, word1 is in document0.txt, but not in document1.txt.

And you want to turn this index into:

For each word you have the documents in which it appears. Here, word0 appears in document0.txt and document1.txt.

This is... Read more

N-gram models in Scala

So simple it hurts.
Thanks to Scala and with the power of functional programming.

Let’s start!

Wait! N-gram?

I’ll assume you know what a n-gram is. If not, read this.
By the way, in NLP, a n-gram is what we call a Language Model (LM).

Let’s build it, step by step

So, we need some text data to learn from.
We’ll go with a simple example:

“The blue sky is near the red koala near the blue sky”

Now, let’s tokenize it.
Here, we will simply split it... Read more


This is it.
Something is coming.