Packages

So far in this class, we have focused “small programming”: Functions, algorithms, data structures etc. We did not talk about structuring code itself. We will do that today, by having a closer look at modules and show how to create your own packages.

This is relevant for your project, as you should package your project nicely up this way, so that its dependencies are well specified

Modules

We start with the code from last class, where we ran some statistics over some data. The module contains code that does various things:

Obviously, this is not a good style in the long run, as the file will just grow too big. Also, it does not encourage reuse: In another project, it is likely that one wants to use the same parsing code, but then do different things with the data.

Name and path

Therefore, we will put the parsing code into a module of its own. For that, we need to come up with a module name. You have already seen the pattern: Capitalized words separated by dots. A good name might be Data.Lahmann.Parser.

The module name dictactes the filename, with dots replaced by directory separators. We therefore take parsing code (some of which we have to rip out of the main function) into a file named Parser.hs in a directory Data/Lahmann, which has to start with

module Data.Lahmann.Parse where

In our main program, we can now import Data.Lahmann.Parse and everything works as before. A few imports can be removed, which is a good sign: The user of Data.Lahmann.Parse should not have to worry about whether the data is stored in CSV files or any other format. Using the flag -fwarn-unused-imports assists in finding useless imports.

It make make sense to not hard-code the directory name of the files into the parsing code, but rather pass that from the main function.

Exporting only what is required

By saying import Data.Lahmann.Parse in the main file, we are importing everything that that module exports. The module currently exports everything that is defined in that module, including internal helpers such as addBattingSums. We can see that by firing up GHCi and looking at the output of :browse Data.Lahmann.Parse.

This is bad style: The user of the Parse module should be oblivious about the internal workings of the module, and the programmer resonsible for the Parse module should have the freedom of adding, removing or changing any internal functions without breaking other code.

Therefore, a module should always declare its public interface by stating which functions, types and constructors are exported. This is done via an export list that names all exported entities after the module name:

module Data.Lahmann.Parse
  ( PlayerId         -- exporting a type synonym
  , Batting(Batting) -- exporting a type with a constructor
  , Player(..)       -- exporting a type with all its constructors
  , …
  ) where

Where to put the types?

Let us put the statistics code into a module Data.Lahmann.Stats. We now run into a problem: Where do we put the type definitions (data Batting, data Player)? They currently reside in Data.Lahmann.Parse, but having the statistics module import that feels very wrong. AS it would if we moved the type definitions to the statistics module and import that.

The only way out is to create yet another module, say, Data.Lahmann.Types, that defines all the common types of our project.

And finally, the module Data.Lahmann.Stats.HTML contains code to take the data, runs the stats, and writes it to a given filename.

Packages

We now have a bunch of modules, and a main program. But these are not completely independent artifacts, but are rather inherently connected, and we want to express this connection. A bunch of related modules is called a package.

We have worked with packages before: We installed existing packages from Hackage using the cabal tool. Now we will create our own package.

A cabal file

To do so, we simply run cabal init in the current directory and answer all the questions. We start with a library package. Afterwards, we open the lahmann-stats.cabal file and clean it up a bit, so that it might look like this:

name:                lahmann-stats
version:             0.1.0.0
synopsis:            Baseball stats
license:             BSD3
license-file:        LICENSE
author:              Joachim Breitner
maintainer:          mail@joachim-breitner.de
category:            Data
build-type:          Simple
extra-source-files:  ChangeLog.md
cabal-version:       >=1.10


library
  exposed-modules:
    Data.Lahmann.Types,
    Data.Lahmann.Parser,
    Data.Lahmann.Stats,
    Data.Lahmann.Stats.Html

  build-depends:
     base >=4.8 && <4.9,
     bytestring >=0.10 && <0.11,
     blaze-html >=0.8 && <0.9,
     cassava >=0.4 && <0.5,
     vector >=0.11 && <0.12,
     containers >=0.5 && <0.6,
     filepath >=1.4 && <1.5,
     statistics >=0.13 && <0.14
  default-language:    Haskell2010

I removed the Main module from exposed-modules, because this is not part of the library, but rather of the program.

Building and installing

Running cabal install in this diretory will now build this library and install it in your local Haskell package. You can now use it from other projects. Neat.

Adding an executable

Our project is not only a library, but also an executable program. We can tell this to cabal as well, by adding a new stanza for our program:

executable lahmann-stats
  main-is: Main.hs
  build-depends:
     base >=4.8 && <4.9,
     bytestring >=0.10 && <0.11,
     blaze-html >=0.8 && <0.9,
     cassava >=0.4 && <0.5,
     vector >=0.11 && <0.12,
     containers >=0.5 && <0.6,
     filepath >=1.4 && <1.5,
     statistics >=0.13 && <0.14
  default-language:    Haskell2010

For now I was lazy and simply copied the build-depends from above. It would also have been possible to depend on the lahmann-data library here, but then Main.hs would have to live in a different directory, for technical reasons (namely that cabal cannot tell GHC not to try to use the other modules if they are in the same directory).

When we run cabal install now, we find the compiled program in ~/.cabal/bin/.

Creating documentation

Another benefit of having a package is that it is easy to create API documentation for the library. Simply run cabal haddock and it will create documentation that looks much like the documentation that we have seen so far.

We notice that only exported functions appear in the documentation. That is good and makes sense.

But we also notice that there is no text. Unfortunately, documentation does not write itself.

To add documentation to a function or type, we add a comment before it which starts with a | (or alternatively a comment afterwards that start with a ^):

-- | Reads the player data from a downloaded copy of Lahmann's baseball
-- statistics <http://seanlahman.com/baseball-archive/statistics/>.
--
-- The first parameter indicates the directory name where the files can
-- be found.
readPlayerStats :: FilePath -> IO (M.Map PlayerID Player)

The tool that creates these documentation is called haddock, and you should consult the haddock manual for more information on the markup language you can use here, and other features of haddock.

Creating a tarball

If you want to distribute your finished package, you can use the command cabal sdist to bundle everything require up into a .tar.gz file such as this one.

Annoyingly, cabal does not check if you have included all necessary modules in the exposed-modules section, so make sure to unpack the resulting file somewhere else and try to cabal build it!

Further reading

Cabal files can also specify test suites and benchmark suites, and have a few more features that you might want to know about. See the cabal manual and also the wiki page on the topic.