Return of the Data Munging Kata

> {-# LANGUAGE FlexibleInstances #-} 
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE InstanceSigs #-}

> module Kata where

This problem asks you to revisit the data munging Kata from the first homework assignment, to see how type classes can help structure programs to separate library routines from business logic.

In this problem, you can use any function in the Haskell Prelude or the following libraries:

> import qualified Data.Char as Char
> import qualified Data.List as List
> import qualified Data.Maybe as Maybe
> import qualified Data.Foldable as Foldable

> import System.IO

The code below also uses the following functions that we'll import explicitly.

> import Text.Read  (readMaybe)
> import Test.HUnit ( (~?=), Test(TestList) )

We'll also use the following library for NonEmpty lists

> import Data.List.NonEmpty (NonEmpty(..)) -- Lists that are NonEmpty, see below
> import qualified Data.List.NonEmpty as NE

Prelude: Working with `NonEmpty` lists

This time around, you need to be more careful about the use of functions that could trigger a runtime error. For example, calling the maximum function with an empty list triggers a runtime error.

  *Kata> maximum []
  maximum []
  *** Exception: Prelude.maximum: empty list

Furthermore, this kind of exeception is difficult to recover from in Haskell! Because it can be difficult to know for certain that a list is guaranteed to not be null, a good approach is to get the type system to help out.

The Haskell base library includes the following module that defines the parameterized type NonEmpty a to represent non-empty lists. This type is represented using standard lists and supports many of the usual operations that lists do.

For example, we can construct a nonempty list out of a regular list and a distinguished first element. For example,

    *Kata> let x = 'a' :| ['b', 'c', 'd']
    let x = 'a' :| ['b', 'c', 'd']
    *Kata> :type x
    :type x
    x :: NonEmpty Char

Because we know that there is a first element, the head operation is guaranteed to succeed. This function is defined in the Data.List.NonEmpty module, which we have imported using the shorter qualifier NE.

    *Kata> NE.head x
    NE.head x
    'a'

Furthermore, many of the usual list functions, such as maximum and length are overloaded, so they are available for both regular and NonEmpty lists.

    *Kata> length x
    length x
    4
    *Kata> maximum x
    maximum x
    'd'

Kata revisited

The goal of this problem is to develop a library for working with ad hoc data files like we saw in the first homework assignment.

This library should support good design: we want to cleanly separate the job of parsing the data file from any sort of processing that we want to do with that data. Furthermore, because we are creating a reusable library, we'll work with a simplified form of a standard data format---data files that separate values by commas. (The design of the library is loosely inspired by Cassava, a Haskell library for working with CSV data.)

A key idea of this library is that it should support working with a data model that represents only the data from the file that is relevant to our application.

For example, for the weather data, we might want to define a type that includes only the information from the file that we care about: the day number and the minimum and maximum temperatures.

> -- | Weather model
> data Weather = Weather {
>     day :: Day, maxTemp :: Int, minTemp :: Int 
>     } deriving (Eq, Show)

Furthermore, we will use a special purpose type to represent the day of the month. (A newtype is a restricted form of datatype that allows only one variant and one field.)

> -- | Type for Day of the month information: an Int in the range 1-31
> newtype Day = Day Int deriving (Eq, Show)

Using the library

The main operation of this library is the function parse, defined below, of type

  parse :: ParseRecord a => String -> Maybe (NonEmpty a)

This function parses the input file into a non-empty list data records. In the case of weather files, these records will be of type Weather, representing the weather data in each row of the input. The parse function needs extra information to figure out how to interpret the input string. The first line of input should be a header row that specifies the columns of the file, separated by commas. Then each row of the file can be split up into columns and parsed into the data model, using the overloaded parseRecord function for that type.

For example, for the Weather type, the definition looks like this:

> -- | Parse weather information from rows of tabulated strings
> instance ParseRecord Weather where
>     parseRecord :: [(String,String)] -> Maybe Weather
>     parseRecord row = Just Weather <**> row !? "DY" <**> row !? "MAX" <**> row !? "MIN"

Above, the parseRecord function expects a data row that has been converted into a simple dictionary of type [(String,String)]. The first component of each pair in the list is the name of the column (from the header row) and the second component should be a string with the value in that row. For example the data from the first line of jul21.csv would look like:

  [("DY"," 1"),("MAX"," 86"),("MIN"," 70"), ...]

To take this input and create our data model concisely, we use two infix operators. The first (!?) is just a wrapper for the library function lookup from Data.List. The second (<**>) is a function that we'll define below in our library.

> (!?) :: [(String,String)] -> String -> Maybe String
> row !? name = List.lookup name row

The (<**>) function has to know how to convert each data string into the appropriate type for the data model. For this, it relies on another overloaded function, parseField. (See below for the declaration of the ParseField class.)

For example, to parse a Day field, we create the following instance:

> -- | Parse day information, returns Nothing when the int is out of range
> instance ParseField Day where
>     parseField :: String -> Maybe Day
>     parseField str = case readMaybe str of 
>                        Nothing -> Nothing
>                        Just x -> if x < 0 || x > 31 then Nothing else Just (Day x)

This function converts a given string (such as " 1") into a Day while validating that it is in range. This parsing works in two steps: first we use the (overloaded) readMaybe function from the Text.Read library to ensure that the String represents an Int value. Then we validate the range of the Int before constructing a Day.

The library contains instances of ParseField for built-in types, like Int (see below). So this is all that we need to parse weather files.

> parseWeather :: String -> Maybe (NonEmpty Weather)
> parseWeather = parse

Business logic

Once we have defined our data model, the main program is short and sweet. For example, the program below computes the day of the month with the minimum temperature difference.

> -- | Compare two weather records based on the difference between their
> -- minimum and maximum temperatures
> compareTempDiff :: Weather -> Weather -> Ordering
> compareTempDiff w1 w2 = compare (diff w1) (diff w2) where
>     diff w = maxTemp w - minTemp w

> -- | A program that computes information about the weather
> weatherProgram :: IO ()
> weatherProgram = do
>     chars <- readFile "jul20.csv"
>     case parse chars of
>        Nothing ->
>          putStrLn "Cannot parse weather file."
>        Just ws -> do
>          let ans = day (Foldable.minimumBy compareTempDiff ws)  
>          putStrLn $ "The day with minimum temperature change was " ++ show ans

The punchline of this problem is that all we need to do to calculate the minimum temperature spread for weather files is the code to this point. Everything after this line is part of the implementation of a reusable library.

The Library

Now, we develop our general purpose library for processing data files. Your job for this section is to get weatherProgram above to work.

Unlike the first assignment, this library will assume that the data is stored in a CSV file. Notice that in both jul21.csv and soccer20.csv the first line of the file is a header row.

Your job is to implement the following two functions that divide up the header row by commas and divide up each data row, matching it up with the headers. If there are not enough header or data values, splitRow should produce as much output as it can.

> -- >>>  splitHeader "A,B,C"
> -- ["A","B","C"]
> splitHeader :: String -> [String]
> splitHeader = undefined

> -- >>> splitRow ["A","B","C"] "1,2,3"
> -- [("A","1"),("B","2"),("C","3")]
> -- >>> splitRow ["A", "B"] "1,2,3"
> -- [("A","1"),("B","2")]
> -- >>> splitRow ["A","B","C"] "1,2"
> -- [("A","1"),("B","2")]
> splitRow :: [String] -> String -> [(String, String)]
> splitRow header row = undefined

Note that a real library for CSV parsing would ignore ,s that appear inside quotes and treat them as part of the data (or header name). You do not need to do so for this exercise.

With these these functions, we can now define a general purpose parser for data files. This parser can produce any sort of result data, as long as there is an instance of the ParseRecord class (such as the one for Weather records above).

> class ParseRecord a where
>     -- Convert a list of string elements into row
>     parseRecord :: [(String,String)] -> Maybe a

> -- | Parse a data file into a list of data rows
> -- First line of the file must be a header line
> -- any rows in the data file that are unparseable are ignored
> parse :: ParseRecord a => String -> Maybe (NonEmpty a)
> parse str = NE.nonEmpty $ Maybe.mapMaybe parseRecord (tabulate str) where
>     tabulate :: String -> [[(String,String)]]
>     tabulate s = case lines s of
>         []        -> []
>         hd : rows -> map (splitRow header) rows where
>             header = splitHeader hd

There is nothing more for you to do for this part, other than make sure that you understand how parse works.

Field parsing

The library also includes an auxiliary class that we can use to tell the library how to parse each field of a record (such as Weather).

> class ParseField a where
>     parseField :: String -> Maybe a

For example, many of the fields of the Weather record are Ints. Therefore, we need to define an instance of this class for the Int type.

> instance ParseField Int where
>     parseField = readMaybe

In fact, many of Haskell's types can be parsed using Haskell's readMaybe library function, which also ignores any spaces before or after the value.

> instance ParseField Bool where
>     parseField = readMaybe

> instance ParseField Double where
>     parseField = readMaybe

We also can add an instance for Strings that doesn't do any parsing at all --- just passes the value through.

> instance ParseField String where
>     parseField = Just

Field chaining

The implementation of ParseRecord uses ParseField implicitly through the use of the binary operator <**>. This operator, defined below, combines together the various fields in the row. It uses ParseField to allow the type of each field in the structure to determine how it should be parsed as a Haskell value.

For example, suppose we have this example data model

> data Example = Example Int Day deriving (Show)

Then we can build/validate input field by field, returning Nothing if any of them fail to parse.

> -- >>> Just Example <**> Just "12 " <**> Just " 12"
> -- Just (Example 12 (Day 12))
> -- >>> Just Example <**> Just "47 " <**> Just " 47"
> -- Nothing

Your job is to fix the implementation of this operation, using parseField. The infixl tells the Haskell parser what precedence to use for this infix operator.

> infixl 4 <**>
> (<**>) :: ParseField a => Maybe (a -> b) -> Maybe String -> Maybe b
> 
> _ <**> _              = Nothing

At this point you should be able to run the weatherProgram above. Make sure that it works correctly before continuing.

Parsing different types of fields

The Weather data constructor only takes Day and Int arguments, but parseRecord is more general than that. All we need is an instance of ParseField to parse other types of data.

For example, String fields need no conversion, so they can be returned immediately.

A more full featured library would include additional instances for the basic Haskell types.

> -----------------------------------------------------------------
> -- Weather events

However, let's do more domain-specific parsing.

For example, the weather file includes a column (marked WX) for weather events. We can represent these events with a datatype. If multiple events happen on the same day, there will be multiple entries in the column.

> data Event = Fog | IntenseFog | Thunder | Ice | Hail
>            | FreezingRain | Duststorm | Smoke | BlowingSnow | Tornado
>     deriving (Eq, Ord, Enum, Show)

Your next job is to parse single events according to the legend shown in the data file. More specifically, your implementation should look at a single character ('1'-'9' or 'X') and produce the appropriate event above.

> parseEventChar :: Char -> Maybe Event
> 
> parseEventChar  _  = Nothing

> testParseEventChar :: Test
> testParseEventChar = TestList [
>     parseEventChar '1' ~?= Just Fog,
>     parseEventChar 'X' ~?= Just Tornado,
>     parseEventChar 'Y' ~?= (Nothing :: Maybe Event)
>   ]

We can use this parser for events to parse the sequence of events in the WX column. We'll start by defining a new type for a list of events.

> newtype Events = Events { getEvents :: [Event] } deriving (Eq,Show)

This newtype allows us to define a special purpose parser that parses each character individually and ignores any invalid events.

> -- >>> parseField "12 " :: Maybe Events
> -- Just (Events {getEvents = [Fog,IntenseFog]})
> -- >>> parseField "1D3 " :: Maybe Events
> -- Just (Events {getEvents = [Fog,Thunder]})

> instance ParseField Events where
>     parseField = undefined

Above, you might find functions in the Data.Maybe library useful.

Working with the Library

In the weather data file, the column marked DEP indicates the difference between the day's average temperature and the usual average temperature for that day. Modify the types above (or define new ones!) and fill in the definitions below so that we can also calculate the day where the temperature is the most unseasonable... i.e. the day with the greatest departure from normal.

> -- | return the day of the month with the largest departure from normal temperature
> mostUnseasonable :: NonEmpty Weather -> Day
> mostUnseasonable = undefined

Next, write a function that returns how many days of the month had some sort of precipitation. The column marked WTR contains this information, when that column has a T, that indicates a trace amount, which should be included in the result.

> -- | return how many days had some sort of precipitation
> numPrecip :: NonEmpty Weather -> Int
> numPrecip = undefined

Finally, write a function that returns the list of all foggy days. This function should count all days that include Fog or IntenseFog as a weather event.

> -- | return a list of all dates with fog events
> foggyDays :: NonEmpty Weather -> [Day]
> foggyDays = undefined

Make sure that you write some tests for these functions! This time we've given you two weather data files to play with: jul20.csv and jul21.csv. Make sure that your implementation works for both of them.

Premier League data

Finally, to make sure that you understand how this library works, use it to process Premier League tables, calculating the place # of the team with the smallest absolute difference between the number of wins W and number of losses L. If there are two teams with the same difference, the program should return the first one.

> soccer :: String -> Maybe Int
> soccer = undefined
> 
> 
> 
> soccerProgram :: IO ()
> soccerProgram = do
>     chars <- readFile "soccer20.csv"
>     case soccer chars of
>       Just answer ->
>          putStrLn $ "The team with the smallest difference placed " ++ show answer ++ "."
>       Nothing ->
>          putStrLn "Couldn't read input file"

CIS 552: Advanced Programming

Fall 2021

Return of the Data Munging Kata

Prelude: Working with `NonEmpty` lists

Kata revisited

Using the library

Business logic

The Library

Field parsing

Field chaining

Parsing different types of fields

Working with the Library

Premier League data

CIS 552: Advanced Programming

Fall 2021

Return of the Data Munging Kata

Prelude: Working with NonEmpty lists

Kata revisited

Using the library

Business logic

The Library

Field parsing

Field chaining

Parsing different types of fields

Working with the Library

Premier League data

Prelude: Working with `NonEmpty` lists