Brice Thomas dev, code, nlp and stuff.

N-gram models in Scala

So simple it hurts.
Thanks to Scala and with the power of functional programming.

Let’s start!

Wait! N-gram?

I’ll assume you know what a n-gram is. If not, read this.
By the way, in NLP, a n-gram is what we call a Language Model (LM).

Let’s build it, step by step

So, we need some text data to learn from.
We’ll go with a simple example:

“The blue sky is near the red koala near the blue sky”

Now, let’s tokenize it.
Here, we will simply split it by using the whitespace:

Then we need to create all the tuples of tokens. In a bigram (2-gram), we need every adjacent couple. For trigram (3-gram), every adjacent triple, etc.

Well, we can do it in one line with sliding:

We will stick with the bigram for simplicity.

For each tuple, we want the probability. To compute it, we first need to count occurences of each unique tuple.

In Scala, we can simply group elements of the list with groupBy(identity) function, which returns a Map. And because same tuples have the same identity, they will be grouped together! Then, we have to mapValues(_.size) it to get the count.

Literally one simple line:

And now, we can compute probabilities. It’s a little harder, but not so much:

Nice, right?

But the ngramWithCount.filterKeys(...).values.sum is very costy if you call it for each entry of the n-gram… When working on medium+ size documents, it will take forever :(

Of course we can fix it, by creating an index for sums:

Way faster!

Make it LM-ready

Now we can predict probabilities of appearance of a word based on the n previous word(s). But in a Language Model (LM), it’s important to take into consideration the beginning and ending of a sentence. And right now we don’t care about it.

By convention, we use <s> and </s> markers. We need to prepend our list of tokens with n - 1 <s> and append 1 </s>.

And that’s all! Then we can sliding(n) it and use the same functions as before to build the n-gram.

Turn it into a module

We have all the necessary to turn all of this to a small Scala module to create n-gram models with one line.

There is nothing new. With this module, the end-user has just one line to write. For example, to create a trigram:

val trigram = NGram.build(myTextDataSource, 3)

And we are done!

To be continued…

In a next post, we will see how to generate text from these n-gram models. It’s really funny and results are even impressive sometimes! Of course, we will need a BIGGER source of text to train our n-gram…