*nix tips: LibZip 0.1: read and write zip archives from Haskell

I am happy to present a major release of my Haskell bindings to libzip library, to manipulate zip archives. It took me longer than I initially expected, but finally I like the result. Essential links first:

Hackage page Source repository Documentation Test coverage report An example

What's new

LibZip 0.1 is a complete remake. Under the hood it was made with bindings-DSL instead of C2Hs as before. The new LibZip offers a lot:

Support of almost all features of libzip: creating, reading, updating, renaming, and deleting files in zip archives, reading and writing file and archive comments. (LibZip 0.0 was read-only)
Support of various data sources: supply contents of a file from a list, from a file on disk, from a file in another archive, or even from a Haskell function.
A new monadic interface. It takes care of managing handlers and pointers behind the scenes. Less to type, less space for a user error.
Unit tests, better documentation and examples.

LibZip the only non-GPL library for Haskell to deal with zip archives. It is also fast to deal with large on-disk archives (more about it below).

Users of LibZip 0.0 (all 1.5 of them) may still use the old API by importing Codec.Archive.LibZip.LegacyZeroZero instead of Codec.Archive.LibZip. However, the old API is deprecated and will not be supported in the future.

LibZip vs zip-archive

There is another Haskell library to deal with zip archives, namely zip-archive. And here are the differences:

	LibZip 0.1/libzip	Zip-Archive 0.1
License	BSD	GPL v.2
Pure?	No	Yes
Large on-disk archives	Fast	Slow

Few notes about the last line. This is the actual reason why LibZip exists. Zip-archive was unacceptably memory-hungry and slow when dealing with large archives. So I started working on LibZip. I suppose that the problem with zip-archive is that it works with lazy bytestrings, not files. Bytestrings are sequential and don't have fseek. There is no reliable way to implement random access.

To get an idea what's the problem, get some moderately sized zip-archive off the web (for example, this one, 22 MiB), and print the list of files using both libraries.

With zip-archive, I used this code:

import Codec.Archive.Zip
import Control.Monad (liftM)
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy as BS

main = mapM_ list =<< getArgs

list file = do
  a <- toArchive `liftM` BS.readFile file
  mapM_ print $ filesInArchive a

On my laptop, it takes 3.5 seconds to run against the downloaded archive:

$ time ./zip-archive-ls pak128-1.4.6--102.2.zip > /dev/null

real 0m3.499s
user 0m3.430s
sys 0m0.050s

And this is the code using LibZip:

import Codec.Archive.LibZip
import System.Environment (getArgs)

main = mapM_ list =<< getArgs

list file =
    withArchive [] file $ do
      names <- fileNames []
      lift $ mapM_ print names

It takes 0.05 seconds on the same file:

$ time ./libzip-ls pak128-1.4.6--102.2.zip > /dev/null

real 0m0.051s
user 0m0.040s
sys 0m0.010s

The difference gets more dramatic as the size of the archive increases. So, in my opinion, license issues aside, LibZip is a better choice when dealing with large archives on disk. Zip-archive may be more suitable choice to generate small archives in memory, without even hitting the disk.

Some implementation notes

I switched from C2Hs to bindings-DSL to implement the FFI bindings. And actually I liked bindings-DSL more. It is simple and makes fewer assumptions about the semantics of the C code. As a result, I had working low-level bindings very early. The rest was just to wrap them with a higher-level API to my liking. C2Hs experience was less smooth: in particular, when a C function is not designed as C2Hs expects it, I had to write wrappers manually anyway (for example, if a function returns a value and writes something to memory). Bindings-DSL seems to be better supported right now.

I changed the order of file names and file access flags in all API functions. File name being the last seems to be more useful for partial function application. An example of such order of arguments is:

fileSize :: [FileFlag] -> FilePath -> Archive Int

I ditched ByteString support from the new API. ByteStrings are my Haskell nightmare: there are too many flavours of them to support, and they are not interchangeable. With LibZip 0.0 I had to support two versions of otherwise identical code, all to discover some time later that I need to pack . unpack bytestrings in the application code (impendance mismatch with another library, which chose to use a different flavour of bytestrings).

In this version I chose to use lists as input and output buffers to some functions (sourceBuffer, sourcePure, readBytes, readContents). I suspect this may have negative performance impact, but it needs to be studied. Marshalling of the byte buffers is another question. I suppose that sourceFile and sourcePure may help to workaround this problem if it actually arises. User feedback is required.

Some of the library functions (most notably sourceBuffer) accept Strings as data buffers. This is convenient for testing, but those Strings should not contains code points above 255. The library doesn't handle text encodings. The user is responsible of providing a correctly encoded byte stream to the library.

Libzip can use C callbacks as data source. LibZip bindings can wrap a pure Haskell function and make the C library call it when necessary (see sourcePure). It is not as convenient as the usual lazy evaluation in Haskell, but, hopefully, may somewhat compensate for impurity of the library. I consider adding also sourceIO.

Thanks?

LibZip is under BSD3 licencse. So it is Free. If you want to say “thanks”, consider using this Flattr button:

I think Flattr is a great idea and I'll be glad if more people start using it.

If you use the library, please let me know. It will make me happier, and will motivate me more to support and improve the library.

*nix tips

2010-09-04

LibZip 0.1: read and write zip archives from Haskell

What's new

LibZip vs zip-archive

Some implementation notes

Thanks?

Subscribe now!

Blog Archive

Labels