In this post we'll be building a simple tool called "Dupe Recoup" which helps us find the rows in a CSV file most likely to be duplicates of each other. Full code for this project can be found at github.com/hylomorphic/dupe-recoup.
The way we'll be using higher-order functions is fairly simple. Each column of the CSV may contain a different kind of data (categorical, string, numeric, completely useless, etc.). We want to create "scoring" functions that allow us to reuse scoring code in a very simple way: a single list of
Scorer functions will give us all we need to compare two entire rows from the CSV.
Though most of this code was written top-down (using a function we'd like to have and defining that function afterwards), we'll go through it bottom-up, so it's easier to see how things build on each other and could be extended.
Of course, before we write any code it will be helpful to examine a sample dataset that we can test with. In this case, we have a simple file listing the 100 most popular baby names of 2017. Each line has a number (the rank of that name's popularity), a baby name, and either "girl" or "boy" as given by our data source.
1,Sophia,girl 1,Jackson,boy 2,Olivia,girl 2,Liam,boy 3,Emma,girl 3,Noah,boy
Even with this simple data, it's clear that some rows of the CSV will be more similar than others. For example, the name "Charlie" appears in two rows (one for "girl" and one for "boy"). We can also begin to see which scorers may be useful: since the ranking doesn't tell us much about name similarity, it may be wise to ignore it.
We're going to start with the most abstract functions we'll need. First up is
applyToEach, which takes a list of functions
[f1, f2..] and two lists of type
[b], and applies each function to its corresponding places in the other lists in order, giving us back
[f1 a1 b1, f2 a2 b2..]
applyToEach :: [a -> b -> c] -> [a] -> [b] -> [c] applyToEach  _ _ =  applyToEach _  _ =  applyToEach _ _  =  applyToEach (f:fs) (x:xs) (y:ys) = f x y : applyToEach fs xs ys
This is quite abstract, but we'll see in a moment why the ability to apply a list of functions to two lists of arguments is useful when comparing rows in our CSV.
We'll also want a helper that gets us all the unique pairs of elements from a list
pairs :: Ord a => [a] -> [(a, a)] pairs l = [(x,y) | x <- l, y <- l, x < y]
With these in place we can start in on the domain-specific code.
Here's a nice function that lets us go from a list of Scorer functions and two rows from the csv, to a single float that gives us a nice score for the similarity between the two rows (where zero is more similar).
scoreOnePair :: [Scorer] -> Row -> Row -> Float scoreOnePair scorers row1 row2 = sum $ applyToEach scorers row1 row2
That's nice, the score for a pair of rows is just the sum of scores for each pair of cells in the two rows. But wait, hang on a second! Where did all these types come from? Here you go:
type Scorer = Text -> Text -> Float type Row = [Text]
Scorer is a function that takes in two
Texts and gives us back a
Float score for their similarity. This helps us compare two cells in the CSV with each other. Similarly, since a row is a list of cells, and our cells are represented as
Row is a list of
Text. Why do we bother renaming
Text -> Text -> Float as
Scorer? It's mostly for simplicity. We know if we see
myFunction :: Scorer, it's going to follow the same interface as the other scorers, and can be used in our list of
Scorers that we apply to pairs of rows. This renaming is cheap, and makes it a bit easier to reason about the code, so let's go ahead and do it.
We have a way to get a score between a single pair of rows, so the obvious thing to do now is generalize this over all pairs of rows. What we'll get back is a list of tuples of scores and the pair they're the score for.
score :: [Scorer] -> [Row] -> [(Float,Row,Row)] score scorers csv = [(scoreOnePair scorers t1 t2, t1, t2) | (t1, t2) <- pairs csv]
Finally, now that we have the list of scores and their pairs, it's pretty easy to sort by the scores.
sortByScore :: [(Float,Row,Row)] -> [(Float,Row,Row)] sortByScore = sortBy (compare `on` first) where first (x,_,_) = x
There are many ways we could score the similarity between two
Text cells in a csv, but we're only going to need a few. By convention, 0 means perfect similarity, and 1 perfect dissimilarity, with other scores in between. The first
Scorer we'll write is a trivial comparison that doesn't add any dissimilarity.
ignore :: Scorer ignore _ _ = 0
We'll later use this for the popularity rank of each baby name, since we don't really care.
Scorer we could make is whether the two
Texts are an exact match. This will help us deal with categorical data, since the strings representing the category of a datapoint have to be exact matches for two points to be in the same category.
exactMatch :: Scorer exactMatch a b | a == b = 0 | otherwise = 1
Finally, we'll need a way to compare the similarity of two strings. Here, I've opted for a version of Levenshtein distance which is squished to between 0 and 1 by dividing by the length of the longer string.
normalizedLevenshtein :: Scorer normalizedLevenshtein a b = if lev == 0 then 0 else lev / maxLength where lev = fromIntegral $ levenshtein a b maxLength = fromIntegral . maximum $ map T.length [a, b] levenshtein :: Text -> Text -> Int levenshtein a b = levenshteinDistance defaultEditCosts (T.unpack a) (T.unpack b)
Scorer will be useful for comparing the similarity of two names.
Here's the final chunk of code we'll need to make our scorer work. First, we initialize a list of scorers (
[Scorer]) which thanks to Haskell's type system, can never contain anything other than functions
Text -> Text -> Float. If we try to add a different function (maybe
Text -> Text -> Int) we get a compilation error, but we also get the assurance we can't make a silly mistake in the signature of any function we're using as a
main :: IO () main = do let scorers = [ignore, normalizedLevenshtein, exactMatch] parsed <- parseCSVFromFile "data/baby_names.csv" case parsed of Left err -> print err Right csv -> mapM_ print . take 10 . sortByScore . score scorers . map (map T.pack) . init $ csv
As you can see, all we need to do then is a simple call to a library which parses the CSV file for us, and then we print out the top 10 most similar pairs of names, sorted by scores as defined in
Here's the result!
(0.125,["42","Isabelle","girl"],["5","Isabella","girl"]) (0.16666667,["1","Sophia","girl"],["75","Sophie","girl"]) (0.16666667,["13","Aubrey","girl"],["64","Audrey","girl"]) (0.16666667,["37","Hailey","girl"],["90","Bailey","girl"]) (0.2,["5","Lucas","boy"],["75","Luca","boy"]) (0.2,["8","Mason","boy"],["94","Jason","boy"]) (0.25,["14","Lily","girl"],["60","Lila","girl"]) (0.25,["27","Mila","girl"],["6","Mia","girl"]) (0.25,["27","Mila","girl"],["60","Lila","girl"]) (0.2857143,["1","Jackson","boy"],["94","Jason","boy"])
Higher-order functions exist in many languages. It's possible to write code similar to what we just wrote in many of them, although sometimes that code is more verbose or could contain side effects.
What an advanced type system really gives us here is peace of mind. When we go back to extend this code for other types of data we might find in a CSV, we know as long as we write our
Scorer correctly as
Text -> Text -> Float, it should slot right into place.
This is a very simple use case for higher-order functions, but the ability to use them here saves us a bunch of processing headache of separating out and recombining columns of the CSV. If we ever want to pick a different scorer for a column, we only need to change a single function name in a single place.
Once again, if you'd like to see the full code for this mini project, you can find it at github.com/hylomorphic/dupe-recoup.