## Using Higher-Order Functions for Score Assignment

In this post we'll be building a simple tool called "Dupe Recoup" which helps us find the rows in a CSV file most likely to be duplicates of each other. Full code for this project can be found at github.com/hylomorphic/dupe-recoup.

The way we'll be using higher-order functions is fairly simple. Each column of the CSV may contain a different kind of data (categorical, string, numeric, completely useless, etc.). We want to create "scoring" functions that allow us to reuse scoring code in a very simple way: a single list of Scorer functions will give us all we need to compare two entire rows from the CSV.

Though most of this code was written top-down (using a function we'd like to have and defining that function afterwards), we'll go through it bottom-up, so it's easier to see how things build on each other and could be extended.

#### Dataset

Of course, before we write any code it will be helpful to examine a sample dataset that we can test with. In this case, we have a simple file listing the 100 most popular baby names of 2017. Each line has a number (the rank of that name's popularity), a baby name, and either "girl" or "boy" as given by our data source.

1,Sophia,girl
1,Jackson,boy
2,Olivia,girl
2,Liam,boy
3,Emma,girl
3,Noah,boy


Even with this simple data, it's clear that some rows of the CSV will be more similar than others. For example, the name "Charlie" appears in two rows (one for "girl" and one for "boy"). We can also begin to see which scorers may be useful: since the ranking doesn't tell us much about name similarity, it may be wise to ignore it.

#### Let's Write Some Functions

We're going to start with the most abstract functions we'll need. First up is applyToEach, which takes a list of functions [f1, f2..] and two lists of type [a] and [b], and applies each function to its corresponding places in the other lists in order, giving us back [f1 a1 b1, f2 a2 b2..]

applyToEach :: [a -> b -> c] -> [a] -> [b] -> [c]
applyToEach [] _ _ = []
applyToEach _ [] _ = []
applyToEach _ _ [] = []
applyToEach (f:fs) (x:xs) (y:ys) = f x y : applyToEach fs xs ys


This is quite abstract, but we'll see in a moment why the ability to apply a list of functions to two lists of arguments is useful when comparing rows in our CSV.

We'll also want a helper that gets us all the unique pairs of elements from a list

pairs :: Ord a => [a] -> [(a, a)]
pairs l = [(x,y) | x <- l, y <- l, x < y]


With these in place we can start in on the domain-specific code.

#### Starting in on the domain-specific code

Here's a nice function that lets us go from a list of Scorer functions and two rows from the csv, to a single float that gives us a nice score for the similarity between the two rows (where zero is more similar).

scoreOnePair :: [Scorer] -> Row -> Row -> Float
scoreOnePair scorers row1 row2 =
maxLength = fromIntegral . maximum $map T.length [a, b] levenshtein :: Text -> Text -> Int levenshtein a b = levenshteinDistance defaultEditCosts (T.unpack a) (T.unpack b)  This final Scorer will be useful for comparing the similarity of two names. #### Tying it all together Here's the final chunk of code we'll need to make our scorer work. First, we initialize a list of scorers ([Scorer]) which thanks to Haskell's type system, can never contain anything other than functions Text -> Text -> Float. If we try to add a different function (maybe Text -> Text -> Int) we get a compilation error, but we also get the assurance we can't make a silly mistake in the signature of any function we're using as a Scorer. main :: IO () main = do let scorers = [ignore, normalizedLevenshtein, exactMatch] parsed <- parseCSVFromFile "data/baby_names.csv" case parsed of Left err -> print err Right csv -> mapM_ print . take 10 . sortByScore . score scorers . map (map T.pack) . init$ csv


As you can see, all we need to do then is a simple call to a library which parses the CSV file for us, and then we print out the top 10 most similar pairs of names, sorted by scores as defined in scorers.

Here's the result!

(0.125,["42","Isabelle","girl"],["5","Isabella","girl"])
(0.16666667,["1","Sophia","girl"],["75","Sophie","girl"])
(0.16666667,["13","Aubrey","girl"],["64","Audrey","girl"])
(0.16666667,["37","Hailey","girl"],["90","Bailey","girl"])
(0.2,["5","Lucas","boy"],["75","Luca","boy"])
(0.2,["8","Mason","boy"],["94","Jason","boy"])
(0.25,["14","Lily","girl"],["60","Lila","girl"])
(0.25,["27","Mila","girl"],["6","Mia","girl"])
(0.25,["27","Mila","girl"],["60","Lila","girl"])
(0.2857143,["1","Jackson","boy"],["94","Jason","boy"])


#### So What?

Higher-order functions exist in many languages. It's possible to write code similar to what we just wrote in many of them, although sometimes that code is more verbose or could contain side effects.

What an advanced type system really gives us here is peace of mind. When we go back to extend this code for other types of data we might find in a CSV, we know as long as we write our Scorer correctly as Text -> Text -> Float, it should slot right into place.

This is a very simple use case for higher-order functions, but the ability to use them here saves us a bunch of processing headache of separating out and recombining columns of the CSV. If we ever want to pick a different scorer for a column, we only need to change a single function name in a single place.

Once again, if you'd like to see the full code for this mini project, you can find it at github.com/hylomorphic/dupe-recoup.