# Think Stats: Exploring Data

This is the third instalment of our *Think Stats* study group; we are
working through Allen Downey's
Think Stats,
implementing everything in Clojure. In the
previous part we
showed how to use functions from the Incanter
library to explore and transform a dataset. Now we build on that
knowledge to explore the National Survey for Family Growth (NFSG) data
and answer the question *do first babies arrive late?* This takes us
to the end of chapter 1 of the book.

If you'd like to follow along, start by cloning our thinkstats repository from Github:

```
git clone https://github.com/ray1729/thinkstats.git --recursive
```

Change into the project directory and fire up Gorilla REPL:

```
cd thinkstats
lein gorilla
```

## Getting Started #

Our project includes the namespace `thinkstats.incanter`

that brings
together our general Incanter utility functions, and
`thinkstats.family-growth`

for the functions we developed last time
for cleaning and augmenting the female pregnancy data.

Let's start by importing these and the Incanter namspaces we're going to need this time:

```
(ns mysterious-aurora
(:require [incanter.core :as i
:refer [$ $map $where $rollup $order $fn $group-by $join]]
[incanter.stats :as s]
[thinkstats.gorilla]
[thinkstats.incanter :as ie :refer [$! $not-nil]]
[thinkstats.family-growth :as f]))
```

(We've also included `thinkstats.gorilla`

, which just includes some
functionality to render Incanter datasets more nicely in Gorilla REPL.)

The function `thinkstats.family-growth/fem-preg-ds`

combines reading
the data set with `clean-and-augment-fem-preg`

:

```
(def ds (f/fem-preg-ds))
```

## Validating Data #

There are a couple of things covered in chapter 1 of the book that we
haven't done yet: looking at frequencies of values in particular
columns of the NFSG data and validating against the code book, and
building a function to index rows by `:caseid`

.

We can use the core Clojure `frequencies`

function in conjunction with
Incanter's `$`

to select values of a column and return a map of value
to frequency:

```
(frequencies ($ :outcome ds))
;=> {1 9148, 2 1862, 4 1921, 5 190, 3 120, 6 352}
```

Incanter's `$rollup`

function can be used to compute a summary function
over a column or set of columns, and has built-in support for `:min`

,
`:max`

, `:mean`

, `:sum`

, and `:count`

. Rolling up `:outcome`

by `:count`

will
compute the freqency for each outcome and return a new dataset:

```
(i/$rollup :count :total :outcome ds)
```

:outcome | :total |
---|---|

1 | 9148 |

2 | 1862 |

4 | 1921 |

5 | 190 |

3 | 120 |

6 | 352 |

Compare this with the table in the code book.

## Exploring and Interpreting Data #

We saw previously that we can use `$where`

to select rows matching a
predicate. For example, to select rows for a given `:caseid`

:

```
($where {:caseid "10229"} ds)
```

This could be quite slow for a large dataset as it has to examine every row. An alternative strategy is to build an index in advance then use that to select the desired rows. Here's how we might do this:

```
(defn build-column-ix
[col-name ds]
(reduce (fn [accum [row-ix v]]
(update accum v (fnil conj []) row-ix))
{}
(map-indexed vector ($ col-name ds))))
(def caseid-ix (build-column-ix :caseid ds))
```

Now we can quicky select rows for a given `:caseid`

using this index:

```
(i/sel ds :rows (caseid-ix "10229"))
```

Recall that we can also select a subset of columns at the same time:

```
(i/sel ds :rows (caseid-ix "10229") :cols [:pregordr :agepreg :outcome])
```

pregordr | agepreg | outcome |
---|---|---|

1 | 19.58 | 4 |

2 | 21.75 | 4 |

3 | 23.83 | 4 |

4 | 25.5 | 4 |

5 | 29.08 | 4 |

6 | 32.16 | 4 |

7 | 33.16 | 1 |

Recall also the meaning of `:outcome`

; a value of `4`

indicates a
miscarriage and `1`

a live birth. So this respondent suffered 6
miscarriages between the ages of 19 and 32, finally seeing a live
birth at age 33.

We can use functions from the `incanter.stats`

namespace to compute
basic statistics on our data:

```
(s/mean ($! :totalwgt-lb ds))
(s/median ($! :totalwgt-lb ds))
```

(Note the use of `$!`

to exclude nil values, which would otherwise
trigger a null pointer exception.)

To compute several statistics at once:

```
(s/summary ($! [:totalwgt-lb] ds))
;=> ({:col :totalwgt-lb, :min 0.0, :max 15.4375, :mean 7.2623018494055485, :median 7.375, :is-numeric true})
```

Note that, while `mean`

and `median`

take a sequence of values
(argument to `$!`

is just a keyword), the `summary`

function expects a
dataset (argument to `$!`

is a vector).

## Do First Babies Arrive Late? #

We now know enough to have a first attempt at answering this question. The columns we'll use are:

Column | Description |
---|---|

`:outcome` |
Pregnancy outcome (1 == live birth) |

`:birthord` |
Birth order |

`:prglngth` |
Duration of completed pregnancy in weeks |

Compute the mean pregnancy length for the first birth:

```
(s/mean ($! :prglngth ($where {:outcome 1 :birthord 1} ds)))
;=> 38.60095173351461
```

...and for subsequent births:

```
(s/mean ($! :prglngth ($where {:outcome 1 :birthord {:$ne 1}} ds)))
;=> 38.52291446673706
```

The diffenence between these two values in just 0.08 weeks, so I'd say that these data do not indicate that first babies arrive late.

Here we've computed mean pregnancy length for first baby and others; if
we want a table of mean pregnancy length by birth order, we can use
`$rollup`

again:

```
($rollup :mean :prglngth :birthord (i/$where {:outcome 1 :prglngth $not-nil} ds))
```

:birthord | :prglngth |
---|---|

3 | 47501/1234 |

4 | 16187/421 |

5 | 2419/63 |

10 | 36 |

9 | 75/2 |

7 | 763/20 |

1 | 56782/1471 |

8 | 263/7 |

6 | 1903/50 |

2 | 55420/1437 |

The mean has been returned as a rational, but we can use `transform-col`

to convert it to a floating-point number:

```
(as-> ds x
($where {:outcome 1 :prglngth $not-nil} x)
($rollup :mean :prglngth :birthord x)
(i/transform-col x :prglngth float))
```

:birthord | :prglngth |
---|---|

3 | 38.49352 |

4 | 38.448933 |

5 | 38.396824 |

10 | 36.0 |

9 | 37.5 |

7 | 38.15 |

1 | 38.600952 |

8 | 37.57143 |

6 | 38.06 |

2 | 38.56646 |

Finally, we can use `$order`

to sort this dataset on birth order:

```
(as-> ds x
($where {:outcome 1 :prglngth $not-nil} x)
($rollup :mean :prglngth :birthord x)
(i/transform-col x :prglngth float)
($order :birthord :asc x))
```

:birthord | :prglngth |
---|---|

1 | 38.600952 |

2 | 38.56646 |

3 | 38.49352 |

4 | 38.448933 |

5 | 38.396824 |

6 | 38.06 |

7 | 38.15 |

8 | 37.57143 |

9 | 37.5 |

10 | 36.0 |

The Incanter functions `$where`

, `$rollup`

, `$order`

, etc. all take a
dataset to act on as their last argument. If this argument is omitted,
they use the dynamic `$data`

variable that is usually bound using
`with-data`

. So the following two expressions are equivalent:

```
($where {:outcome 1 :prglngth $not-nil} ds)
(with-data ds
($where {:outcome 1 :prglngth $not-nil}))
```

It's a bit annoying that we have to use `as->`

when we add
`transform-col`

to the mix, as this function takes the dataset as its
first argument. Let's add the following to our `thinkstats.incanter`

namespace:

```
(defn $transform
"Like Incanter's `transform-col`, but takes the dataset as an optional
last argument and, when not specified, uses the dynamically-bound
`$data`."
[col f & args]
(let [[ds args] (if (or (i/matrix? (last args)) (i/dataset? (last args)))
[(last args) (butlast args)]
[i/$data args])]
(apply i/transform-col ds col f args)))
```

Now we can use the `->>`

threading macro:

```
(->> ($where {:outcome 1 :prglngth $not-nil} ds)
($rollup :mean :prglngth :birthord)
($transform :prglngth float)
($order :birthord :asc))
```

We have now met most of the core Incanter functions for manipulating datasets, and a few of the statistics functions. I hope that, as we get further into the book, we'll learn how to calculate error bounds for computed values, and how to decide when we have a statistically significant result.

*This article originally appeared in the Metail Tech Blog.*