Ask a data scientist what the de facto language of data science is, and they'll probably respond with something like Python, R, or Julia. Some may say that language choice doesn't really matter, since most low-level number crunching is written in C/C++ anyways, and anything higher level just provides a binding to the language where the real work takes place.
The world took a big leap forward when it started bundling data together into objects. They could be manipulated with methods, passed around, put into or removed from storage, and more. However, in data science, we don't just care about manipulating raw objects (datapoints) any more. We also care about manipulating transformations over those objects, and how the transformations themselves can be transformed.
There are many reasons why it makes sense to use fully-powered functional programming for doing modern data science work, and we think there are more than a few benefits to using a safe, high-quality type system as well. Of course, any language choice that constricts what programmers are allowed to do (no immutable variables!?) will have its drawbacks, but we've found that we're actually able to get more done with the compiler guiding us to correctness and safety. We understand that what we've chosen isn't right for everyone in every situation, but it's certainly been working well for us.
These are some of the reasons we've chosen to use Haskell as the main development language at Hylomorphic:
Datapoints are immutable. Let's say you're recording the daily temperatures of a few cities near you. While the temperature changes over time, a single full datapoint captures the temperature, the city, and the time it was created. Datapoints don't update, so it makes sense to work with them in a language that not only allows you to mark something immutable (const, final, etc.) but in fact requires that they be immutable.
Once you've taken this step, it's not hard to see that data transformations slot nicely into the place of pure mathematical functions. Want the average of temperatures for each city? That's a mapping of the average function over the list of temperatures in each city. Want only the ten hottest cities? Take that transformation, sort by average temperature, and give back the first ten. The huge benefits from referential transparency—and therefore easy composability of transformations—vastly outweigh the ability to change a datapoint in place.
Of course, there are computational objects in the field of data science which do update over time. A simple example is a neuron in a neural network. While training the network, we want the neuron to update its weights in each step, and then propagate those weights along. This is a reasonable conceptual model when compared to the typical functional way of thinking: at each step, create a new neural network of with the same dimensions and architecture, but with different weights than the network in the previous step. Luckily, the existence of frameworks like Tensorflow allows us to stay in immutable-land. We build up a computational graph in our lovely pure functional environment, and can pass it off to the low-level code which does the dirty work of—fragile readers look away now—mutating objects.
Okay, so we can make trees and lists, protect ourselves from NULL with Maybe, and do pattern matching. What's the big deal?
The largest advantage we've found to using ADTs over traditional objects is the ease with which we can make impossible states impossible to represent. One classic example is a deck of cards. Each card will have a suit and a rank, of course. Already, the easiest thing may be to use a string or an integer to represent suit and rank. Using strings leads to a host of representation problems: what if someone types "herats" instead of "hearts" on accident? What does "" (empty string) represent? If we use numbers, we're scarcely better off. Did 2 mean clubs or spades? If 2 is clubs, and 1 is spades, is spades truly less than clubs? The < operator seems to think so.
Of course, we're being a little overdramatic here. It is possible to use enums in many languages, although it's often not as easy, and may not be the first thing programmers tend to reach for. Additionally, in some languages it's possible to add options to an enum at runtime, ruining the advantages of having a known representation at compile time.
The final big helper we get from ADTs is erasing NULL, nil, None, etc. Using Maybe, Either, or just plain well-designed representational ADTs lets us completely sidestep a whole class of bugs, and ensure that we handle all possibilities at compile time.
How does all this apply to data transformation and ML tasks? One of the hardest parts of data science is cleanup and organization of data, even before running any algorithms on it. With a well-designed system for representing data, and ensuring upfront that we can't accidentally use data in a way it wasn't meant to be used, we actually save a lot of work later on. Stronger guarantees may seem harder to work with initially, but make life so much easier when we need those guarantees in the future, instead of relying on a shaky tower of assumptions.
It's often useful to be able to write code in a way that directly mirrors how you're thinking about a problem within the problem's domain. One feature of Haskell that helps us with this is its laziness (non-strictness), which helps us think of entire programs as data transformation pipelines that only execute necessary parts.
One typical example of the surprising power of laziness is the ability to use infinite data structures. Think of the list of natural numbers,
[0..]. In a strict language, this list would be evaluated, eventually causing the machine to run out of memory, since it doesn't have infinite space. Instead of evaluating immediately, lazy languages only evaluate what is needed later on. This lets us print out the first few elements of any infinite list without overloading memory, for example.
While it's rare to run into infinite datasets, another way this notion is useful is with infinite streams of data. We don't have access to all the data right now, since more could come down the stream later. Yet we want to use the data we do have, and update as more data comes in. One way to handle this is only evaluating as far in the stream as we have so far. This lets us build shockingly efficient data pipelines, which consume data as it comes in, process it, and get us the best result available at any given moment.
There's not too much to say here, especially since a common technique is farming out the hard work to C/C++ where it can run on a GPU or be parallelized or any number of fancy optimizations. However, well-written Haskell code can often run much faster than equivalent Python or R. Laziness lets us prevent doing unnecessary work. And of course, the static type system allows for plenty of optimizations and eliminates the need for run-time type checking. Haskell is fast.
You're working on a list of integers, and writing a function that you're going to apply to each number in the list. Is it possible that there may be a non-integer in the list? Should we worry about checking for that? In a type-safe language, the answer to both of these is no. Static typing lets us guarantee certain things are true, and eliminates potential impossible representations.
A list of integers may only contain integers, and simple facts like these help elminate a huge portion of potential headache. How often is a dataset improperly cleaned, such that there's some sort of string "NaN" or "n/a" that could corrupt our beautiful ints? How often is there some sort of null? Type safety absolutely annhilates these kinds of problems. If we know something is a list of Int at compile time, it will never be anything else.
It's also possible to enforce that dimensions line up. A common problem in data science is having two columns of different sizes that should align. They can become mismatched if we update one column but not the other, whether adding new data, deleting nulls, accidentally splicing at different points, etc. Through the use of certain constraints, we can ensure at compile time that dimensions that should be equal are in fact equal.
Letting the compiler keep track of exactly what's moving where, and what can be operated on in what way, is a lifesaver. It removes tedium, and eliminates entire classes of errors. Computers are very good at dealing with tedious bookkeeping tasks, and humans are not, so let's let them handle it.
There are many functional programming concepts that make sense for data science. Being purely functional means it's much easier to parallelize algorithms, as there's less potential for some part of the program to mutate another part from underneath us. Representing data transforms in a way that closely resembles mathematical functions helps us quickly apply new research to existing pipelines.
Currying alone provides some significant benefits. In Haskell, all functions are curried, meaning we can partially apply them, or apply them to one argument at a time in sequence. This lets us do some interesting things, like reordering our pipelines. Consider the case where we're partially applying a transformation to some dataset, then we go off and do some work that relies on this partial transformation, then we can come back and finish what we started. Curried functions also let us easily build lists of functions to apply to other lists of data, letting us climb one rung higher on the ladder of abstraction than single functions on datapoints.
Finally, the ability to get our code as close as possible to mathematics, rather than a list of instructions, is often useful. We can factor out large chunks of functionality to create easily composable pieces. We can apply certain theorems to our code more easily, and test it with frameworks that automatically generate more test coverage than we'd usually bother writing ourselves.
Haskell isn't the best language for everything ever, or maybe even for most things most of the time. But it's certainly a great fit for the work we do, and we're happy with where it's taken us so far.