5 Things About Scala (that I wish I knew 6 years ago)

Like a lot of developers, I first heard about Scala 6 years ago, when Twitter ported their failure-prone Ruby backend to a language I'd never heard of, that looked kind of like Python, but performed like Java or C++. If I had to give a one-sentence elevator pitch for Scala, I'd still say what I read back then: Scala is a statically compiled language with all of the performance, safety, and tooling of Java, without the verbosity, so that a developer can write Scala code as fast as Ruby, Python, or Javascript.

However, the one big hurdle I had while learning Scala is that it is an unusually large language, with a lot of syntax. Because it has strong support for both object-oriented and functional programming idioms, it's entirely possible to write several succesful Scala programs, then encounter one in a different style and feel like you're starting from scratch--it's certainly happened to me more than once. This post is an attempt to work through a few of the really big-picture "aha" moments I've had since 2009, not just because they're informative, but because at each step, my appreciation for Scala as a language deepened substantially.

Type inference != dynamic typing

If you've seen Java code at all, you've probably seen code in this idiom:

To unpack, the first ThisIsMyClass declares the type of a variable, the lower-case thisIsMyClass names it, and the new ThisIsMyClass() invokes the constructor that actually creates an instance of the class. Scala is different. An equivalent scala line would be

Under the hood, what's going on to allow this is type inference, where the compiler sets the type of the var thisIsMyClass to be whatever the result of ThisIsMyClass() happens to be. This is different than what happens in dynamic languages, like javascript, because there is still a static type assigned to the variable, and that isn't allowed to change. So this will produce a type error:

because var x has the inferred type Int, and the String "1" can't be assigned to it.

Likewise, if we want to create an empty list and then add an item to it, we can run into problems:

Gives us a type error, because we're trying to add an Int to a List and store it in a List[Nothing] variable, when we have a List[Int]. To get this to work, we have to do it like this:

(For background, the double colon :: operator is called cons, and it's used to add an element on the front of a linked list. The :: representation is from ML, a lovely functional language that was a strong influence on Scala's design) Basically, we've told the compiler here that even though the list is empty now, it will eventually hold Ints, and nothing but Ints.

This example also highlights my favorite feature of Scala: type-safe generics. In contrast, in many other languages, like Python, Objective-C, or Javascript there is a single general-purpose List, and you can put whatever you want in it; as a result, there's no guarantee what the next object you pull out of it will be, which can cause all kinds of errors. An experienced programmer could certainly argue that the discipline of only putting one kind of object into a given list is pretty basic, certainly, but I think this is one of those places where Scala really strikes the perfect balance between conciseness and safety.

Options can protect you from errors

Another irritation that every Java programmer (and probably every Java end user) has seen, is something called a NullPointerException. Null pointers can happen when a computer stores the address of an object which itself lives at another location in memory--which is generally fine, as long as you always make sure that you remember to create the object, and keep it's value up to date. If a computer tries to look up the object, and it doesn't exist any more, you'll probably get a segmentation fault. If the object hasn't been created at all, the pointer will probably point at the address 0, a.k.a. null, and you'll get a NullPointerException instead.

Segmentation faults are quite rare in Java, but NPE's are incredibly frequent, because it's quite common to use null to represent the result of a function that can fail: imagine a lookup function that returns an object matching some criteria, that either returns the object, or null if it doesn't find anything. Again, an experienced programmer will point out that as long as the developer knows to check for a null result, it's fine, and without this convention, it's way more work to represent behaviors that can fail, which is a very good point. Scala's solution is a type called Option.

Just like the List we used above, Option is a generic type, which means that it takes another type as a parameter. Option can be thought of as a special container, that can hold either exactly one object of a given type, or nothing at all. What's cool is that Scala provides a lot of ways for us to use Option where the compiler can check that we're using it right, and handle that as compactly or verbosely as we need for the situation at hand. For example, if we have

and we want to get the value out of both, we can just do s1.get which returns "foo", but s2.get raises an error. We would be much safer using getOrElse(), which lets us supply a default value; thus, s1.getOrElse("") still returns "foo", but s2.getOrElse("") now returns the empty string. But that's not all we can do. Now let's put the two Options together into a List of Options

now we can do

Which gives us List("foo",""), which is good, but what if we don't want the None values in the result at all? Then we can do:

which just gives us List("foo"), which is probably exactly what we wanted in the first place.

This might have been a bit in-depth, but what I want to illustrate is how Scala lets us transform operations that may succeed or fail at multiple stages, which are incredibly hard to get right, into operations on zero or more items, which are easy to combine, generalize, and reuse. As programs become larger and more complex, this ability becomes incredibly valuable.

Functions are objects

In many object-oriented languages, most notably Java, functions aren't allowed at all, in lieu of methods, which are behaviors attached to objects. Likewise, many dynamic languages allow functions to be stored in variables and data structures, more or less like objects, but offer no facility for checking that the input or return types of a function match a given interface. In contrast, Scala's type system allows us to describe functions elegantly and precisely. For example, if few expressed the matching operation above as a function, we would do it like this:

Which gives us a result of type: (x: Option[String]) => String, which roughly means "a function that takes an argument named x of type Option[String] and returns a String". This function uses Scala's pattern-matching syntax, which is pretty straightforward, but worth reading up on if you're interested. An even more concise example would be something like:

Which is (x:Int, y:Int) => Int. Since this is a type, just like any other, we can specify it as a function parameter, to define a safe second-order function. For example,

binop takes op as it's first argument, that is, any function that can take two Ints as arguments and return an Int; binop also takes two Ints, x, and y, and when called just passes them into binop and returns the results. Although it's not an obviously useful example, unless you're implementing a pocket calculator, it really illustrates Scala's power at designing and specifying interfaces. If you were building a database library, or a web framework, or some other large piece of modular code, you could allow other developers to extend it with safe, well-defined functional behaviors, to, say, serialize a type for storage in a database, or render it to a web page, for example.

Streams process big data (or infinite data) efficiently

Functional programming can be an obscure art at times, with monads, monoids, semigroups, combinators, and so on; however, it also has concepts that have become so pervasive in mainstream programming as to be indispensable: lexical scope, closures, and lambdas are definitely in that category, and I would argue that lazy lists should be another. But first I have to talk about linked lists.

A linked list is a simple recursive data structure: it consists of a series of cells, each of which has two parts: a head, which is a single piece of data, followed by a tail, which is either (a) a pointer to another cell, or else (b) nothing. In the functional idiom, linked lists are processed one item at a time, starting from the front: you process the head, then if the tail is non-empty, you follow the pointer to the next item, process its data, and continue until you get to a cell with an empty tail, which means you're done. In a purely functional language, all this can be done elegantly and efficiently, but it's certainly possible to implement in any language with pointers. In Scala, it's pretty simple also:

With this definition, I can create small lists by hand like this:

And so on. Which is kind of clunky, so let's write a function that's cleaner.

can add to the front of the list, so we'd start at the back:

which gives us MyList(1,Some(MyList(2,Some(MyList(3,None))))), so we know they're equivalent. Whereas to access items from the list, we can do this:

getFromList will call itself, traversing the list an item at a time, while counting down n, until either (a) it gets to n = 0, at which point it returns Some() of the current item, or (b) it goes off the end of the list, and returns None. And that, in a nutshell, is a linked list.

Before you run off and start using this code, I need to note that you should absolutely not use the code above, because Scala gives you a much, much nicer implementation: in fact, it's the same built-in List() class that we used at the beginning of this post, and Scala has much more convenient and robust tools for manipulating these than I can hack together in a blog post. For example, we've seen that the cons operator :: lets us build up a list like so:

which is way cleaner, or we can just do List(1,2,3), for that matter.

So if that's a linked list, what's a lazy list? A lazy list is similar in many respects. It is cell oriented, and processed one item at a time, in the same head/tail pattern we used before. However, instead of having a pointer to the next item in the tail, a lazy list just has a function to compute the next item, which may not even exist yet. Not only does Scala include a powerful lazy list type in the Stream class, there's even a lazy cons operator #:: that will build it up just like a linked list:

Which is kind of pointless, but because the right-hand side of the #:: operator isn't evaluated until it's used, it's actually possible to create infinite lists:

and now we can do

which creates an infinite list of numbers, starting at 25, takes the first 10, and prints each one, giving us:

Which is, again, cool but kind of pointless. Mathematically-inclined functional programmers will talk about the power of lazy sequences to create infinite data structures: the fibonacci sequence, every prime number, all the digits of pi, and so on; however, I'm a systems programmer at heart, and I care about disks, networks, data, and processor cycles. To me, lazy evaluation is a irreplaceable tool for solving those kinds of problems.

Let's say I have millions of items in a remote database, and want to write a function that scans the whole database and returns only those items having some property or matching some constraint. But I have too many items to hold in memory at once, and beside, my user might only want the first 10 matches, so I shouldn't need to retrieve the whole sequence anyway; on the other hand, I don't know exactly how many database items to pull to get exactly 10 items in the filtered result. Now, this is the sort of thing that any programmer, in any language, should be able to do in 15 minutes or so, but Scala lets us define reusable, general purpose abstractions that will do all the boilerplate for us:

which lets us do things like get all the numbers, starting from 1, that are divisible by 3:

and verify it like this:

which gives us:

Which I think is pretty special, not because of any fancy syntax, but because we've created abstractions that allow developers to create and manipulate infinite sequences without having to worry about the underlying implementation. This power of safe, precise abstraction, and a community devoted to it, is what makes Scala special, and I hope to write more about it soon.

Futures are a safe and easy way to do concurrent processing

So, let's try to actually construct a simple client that retrieves JSON arrays in batches from a simple REST service, parses them, and presents the result as a Stream[Int]. To start, let's just code the basic retrieval function:

Assuming a complementary REST service exists at base_url, this just pulls out count records, starting from offset, with any additional query params. Then it parses the result as Json, and tries to convert the result to a Stream, rather than a List, of Ints. If you're trying at home, you'll need to add the play framework's json library, but if so, you can test it out like this:

whereas this will get you a conversion exception:

Which is enough type safety for the purposes of this talk.

Now, we have a Stream[Int] here from a rest service, but it is by no means infinite, so the "lazy" Stream is sort of beside the point. Let's try to stream the whole thing together:

So now we're fetching JSON in batches, streaming Ints back one at a time, and going back to the server for more when we run out, until the server doesn't have anything left. Not bad for eight lines of code. Implementing a service to test this against is totally outside the scope of this talk, but I'm going to put up something fake with a flat file to do a sanity check.

So, we've succesfully abstracted a remote service into the same Stream[Int] type that we were generating before, which is awesome, but it still has one problem, which is that the whole program is going to stop to fetch those JSON results every time the list goes out. If we're just doing simple integer manipulations, all the computation done by the program may go by thousands, or even millions, of times faster than we can gulp down from the internet, so we're just going to be sitting and waiting processor cycles for most of the program's lifecycle.

Not bad. Now we're going to start retrieving the next batch as soon as we start processing the current one. That concurrent.ExecutionContext.Implicits.global gives us a thread pool to execute tasks on, and Future { getJSON(base_url,params,offset + count, offset) } sets up a task, and returns a Future[Stream[Int]] that will eventually contain a response, but doesn't have to finish until we get to the Await.result() call, only after all the current items have been consumed.

So now we have parallelism, but we might not have enough. We can start the requests a little bit ahead of time, but as long as we're only pulling 50 results in at a time, we'll still be spinning our wheels a lot of the time. We have a few ways we can tune this; one way is to increase the count parameter, either statically or with some kind of ramp-up factor, which is probably adequate for a lot of cases, but in this case, I'm going to do multiple concurrent queries to try to max out my throughput.

Whew! So, this is actually a tricky real-world problem, and this is by no means a comprehensive solution. It's not even particularly idiomatic Scala. But it does start to demonstrate the full power of Scala's type system. I've constructed a stream of single Int results, parsed in order from concurrent REST queries, interleaved with lazy function calls to generate more query batches as needed. It also returns immediately, without waiting for the initial query to return, so it's fully asynchronous.