Think Stats: Our First Histograms
This is the fourth instalment of our Think Stats study group; we are
working through Allen Downey’s Think
Stats, implementing
everything in Clojure. This week we made a start on chapter 2 of the
book, which introduces us to statistical distributions by way of
histograms. This was our first encounter with the incanter.charts
namespace, which we use to plot histograms of some values from the
National Survey for Family Growth dataset we have worked with in
previous sessions.
You can find previous instalments from the study group on our blog:
If you’d like to follow along, start by cloning our thinkstats repository from Github:
git clone https://github.com/ray1729/thinkstats.git --recursive
Change into the project directory and fire up Gorilla REPL:
cd thinkstats
lein gorilla
Getting Started #
As usual, we start out with a namespace declaration that loads the namespaces we’ll need:
(ns radioactive-darkness
(:require [incanter.core :as i
:refer [$ $map $where $rollup $order $fn $group-by $join]]
[incanter.stats :as s]
[incanter.charts :as c]
[incanter-gorilla.render :refer [chart-view]]
[thinkstats.gorilla]
[thinkstats.incanter :as ie :refer [$! $not-nil]]
[thinkstats.family-growth :as f]))
There are two additions since last time: incanter.charts
mentioned
above, and incanter-gorilla.render
that provides a function to display
Incanter charts in Gorilla REPL.
We start by generating a vector of random integers to play with:
(def xs (repeatedly 100 #(rand-int 5)))
We can generate a histogram from these data:
(def h (c/histogram xs))
This returns a JFreeChart
object that we can display in Gorilla REPL
with chart-view
:
(chart-view h)
If you’re running from a standard Clojure REPL, you should use the
view
function from incanter.core
instead:
(i/view h)
The first thing we notice about this is that the default number of bins
is not optimal for our data; let’s look at the documentation for
histogram
to see how we might change this.
(require '[clojure.repl :refer [doc]])
(doc c/histogram)
We see that the :nbins
option controls the number of bins. We can also
set the title and labels for the axes by specifiyng :title
, :x-label
and :y-label
respectively.
(chart-view (c/histogram xs :nbins 5
:title "Our first histogram"
:x-label "Value"
:y-label "Frequency"))
We can save the histogram as a PNG file:
(i/save (c/histogram xs :nbins 5
:title "Our first histogram"
:x-label "Value"
:y-label "Frequency")
"histogram-1.png")
Birth Weight #
Now that we know how to plot histograms, we can start to visualize values from the NSFG data set. We start by loading the data:
(def ds (f/fem-preg-ds))
Plot the pounds part of birth weight (note the use of $!
to exclude
nil values):
(chart-view (c/histogram ($! :birthwgt-lb ds) :x-label "Birth weight (lb)"))
…and the ounces part of birth weight:
(chart-view (c/histogram ($! :birthwgt-oz ds) :x-label "Birth weight (oz)"))
We can see immediately that these charts are very different, reflecting the different “shapes” of the data. What we see fits well with our intuition: we expect the ounces component of the weight to be distributed fairly evenly, while most newborns are around 7lb or 8lb and babies bigger than 10lb at birth are rarely seen.
Recall that we also computed the total weight in pounds and added
:totalwgt-lb
to the dataset:
(chart-view (c/histogram ($! :totalwgt-lb ds) :x-label "Total weight (lb)"))
This does not look much different from the :birthwgt-lb
histogram, as
this value dominates ounces in the computaiton
A Few More Histograms #
The shape of a histogram tells us how the data are distributed: it may
be approximately flat like the :birthwgt-oz
histogram, or bell-shaped
like :birthwgt-lb
, or an asymetrical bell (with longer tail to the
left or to the right) like the following two.
(chart-view (c/histogram ($! :ageatend ds)
:x-label "Age"
:title "Mother's age at end of pregnancy"))
Let’s try that again, excluding the outliers with an age over 60:
(chart-view (c/histogram (filter #(< % 60) ($! :ageatend ds))
:x-label "Age"
:title "Mother's age at end of pregnancy"))
Finally, let’s look at pregnancy length for live births:
(chart-view (c/histogram ($! :prglngth ($where {:outcome 1} ds))
:x-label "Weeks"
:title "Pregnancy length (live births)"))
We have now reached the end of section 2.4 of the book, and will pick up next time with section 2.5.
This article originally appeared in the Metail Tech Blog.