Introducitng utilities for Iterator::Simple

June 24, 2012
perl

I recently uploaded my first module to CPAN. Iterator::Simple::Util implements most of the functions from List::Util and List::MoreUtils for the Iterator::Simple framework.

Iterators provide a simple interface for traversing (iterating over) collections. This interface typically consists of two methods, one that tests whether or not the iterator is exhausted, and another that returns the next element. The main power of iterators is the abstraction they provide over a collection. If you write code that takes an iterator as input, it will work whether the data is an in-memory array, records retrieved lazily from a database, or lines parsed from a file. For an extensive introduction no the subject, check out chapter 4 of Higher Order Perl by Mark Jason Dominus (available free for download).

There are two main implementations of iterators on CPAN. Iterator< aims to be the definitive implementation; it uses exceptions to signal that the iterator is exhausted, which allows it to work with collections containing undef values. Iterator::Simple is–as the name suggests–a simpler implementation; it signals that an iterator is exhausted by returning undef, which of course means you cannot use it to iterate over collections that might contain undef as a value. I prefer the Iterator::Simple interface, so tend to use this module when I know the data I’m working with does not contain undefined values.

Iterator::Simple implements a number of utility functions for working with iterators: filter, flatten, chain, zip, enumerate, slice, head and skip (corresponding functions for Iterator< can be found in the Iterator::Util module), but neither module provides the wealth of functions you will find for working with lists in the List::Util and List::MoreUtils modules. Enter Iterator::Simple::Util. This module implements all of the familiar list utilities: ireduce, isum, imax, imin, imax_by, imin_by, imaxstr, iminstr, imaxstr_by, iminstr_by, iany, inone, inotall, ifirstval, ilastval, ibefore, iafter, ibefore_incl, iafter_incl, and inatatime.

Examples

Here are some simple examples to get you started. Suppose we are working with the following data. This is a small dataset, but iterators allow us to work with bigger datasets than will fit in memory–if the data were read from a file or database, most iterator functions would load only one record into memory at a time.

    my @data = (
        { region => 1, household => 1, salary => 10000 },
        { region => 1, household => 2, salary => 10000 },
        { region => 1, household => 3, salary => 12000 },
        { region => 2, household => 4, salary => 10000 },
        { region => 2, household => 5, salary => 12000 },
        { region => 3, household => 6, salary => 15000 },
        { region => 3, household => 7, salary => 12000 },
        { region => 4, household => 8, salary => 12000 }
    );

We construct an iterator like so:

    use Iterator::Simple qw( iter );

    my $it = iter \@data;

Use imap to extract the salary field; imap returns an iterator that we can pass as an argument to a utility function, in this case imax:

    use Iterator::Simple qw( iter imap );
    use Iterator::Simple::Util qw( imax );

    my $max_salary = imax imap { $_->{salary} } iter \@data;

    # 15000

Sometimes, we want to extract the entire record with the maximum salary. That’s where imax_by comes into play:

    use Iterator::Simple qw( iter );
    use Iterator::Simple::Util qw( imax_by );

    imax_by { $_->{salary} } iter \@data;
    # { household => 6, region => 3, salary => 15000 }

The igroup function is my attempt to implement a common pattern of processing subgroups of a sorted dataset. For example, suppose you want to extract the record with the maximum salary in each region. Our dataset is already sorted by region, so we just need to tell igroup to group by region. igroup returns an iterator; each element returned by the iterator is in turn an iterator that will return all records in the matching group.

    use Iterator::Simple qw( iter );
    use Iterator::Simple::Util qw( igroup imax_by );

    my $by_region = igroup { $a->{region} == $b->{region} } iter \@data;

    my @region_max;
    while( my $it = $by_region->next ) {
        push @region_max, imax_by { $_->{salary} } $it;
    }

    # @region_max contains 4 elements (one for each region):
    #  { region => 1, household => 3, salary => 12000 },
    #  { region => 2, household => 5, salary => 12000 },
    #  { region => 3, household => 6, salary => 15000 },
    #  { region => 4, household => 8, salary => 12000 }