Brice Thomas dev, code, nlp and stuff.

Inverted index in Scala

Last week I had to build an inverted index to speed up (a lot) a program doing parallel document identification. And I was quite impressed by how simple it was in Scala!

Inverted index?

It’s simple, let’s say you have this index:

For each line you have a document identifier followed by the words appearing in the document. Here, word1 is in document0.txt, but not in document1.txt.

And you want to turn this index into:

For each word you have the documents in which it appears. Here, word0 appears in document0.txt and document1.txt.

This is... Read more

N-gram models in Scala

So simple it hurts.
Thanks to Scala and with the power of functional programming.

Let’s start!

Wait! N-gram?

I’ll assume you know what a n-gram is. If not, read this.
By the way, in NLP, a n-gram is what we call a Language Model (LM).

Let’s build it, step by step

So, we need some text data to learn from.
We’ll go with a simple example:

“The blue sky is near the red koala near the blue sky”

Now, let’s tokenize it.
Here, we will simply split it... Read more

Welcome

This is it.
Something is coming.