Blog Post 0 - Data Visualization of the Palmer Penguins Data Set

In this post, I’ll construct some interesting data visualization of the Palmer Penguins data set.


The Palmer Penguins data set was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. Download the CSV data. It contains measurements on three penguin species: Chinstrap, Gentoo, and Adelie.


Illustrations of the penguin species in the Palmer Penguins data set, by Allison Horst.

Exploring and Understanding Data

I will import the required Python modules at begining for convenience.

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

It’s essential to understand the data we have. In this section, I will explore the penguin dataset to help me decide what information is helpful. Then I will decide to construct which data visualization for the Palmer Penguins data set.

I will then run the next cell to import the penguin dataset as a pandas DataFrame called penguins.

Once I have read the penguins dataset into a pandas dataframe, we can take a look at the first five rows of the dataset using penguins.head().

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

The data set contains 17 columns, and some important column heading variables have the following meanings:

  • Species: Three penguin species (Adelie, Chinstrap, Gentoo).
  • Island: Three islands (Torgersen, Biscoe, Dream).
  • Date Egg: The date penguin was born.
  • Culmen Length (mm): penguin’s culmen length (mm).
  • Culmen Depth (mm): penguin’s culmen depth (mm).
  • Flipper Length (mm): penguin’s flipper length (mm).
  • Body Mass (g): penguin’s body mass (g).
  • Sex: penguin’s sex.
  • Delta 15 N (o/oo): measures of nitrogen in the penguin’s bloodstreams.
  • Delta 13 C (o/oo): measures of carbon isotopes in the penguin’s bloodstreams.

abird.jpeg

A bird (probably not a penguin) getting its culmen length measured

Next, I will use the function df.nunique(), which counts the number of distinct elements in all columns without counting Nan values. Knowing how many unique values each column has can help me decide which columns I will drop later.

penguins.nunique()
    studyName                3
    Sample Number          152
    Species                  3
    Region                   1
    Island                   3
    Stage                    1
    Individual ID          190
    Clutch Completion        2
    Date Egg                50
    Culmen Length (mm)     164
    Culmen Depth (mm)       80
    Flipper Length (mm)     55
    Body Mass (g)           94
    Sex                      3
    Delta 15 N (o/oo)      330
    Delta 13 C (o/oo)      331
    Comments                 7
    dtype: int64

From above, I notice that all penguins are from the same Region and have the same Stage status. I will drop these two columns as it’s not valuable for my visualization. Also, I noticed that there are three different types of sex, which is a little bit strange.

In the next cell, I will check what another type of sex besides Male and Female is.

penguins["Sex"].unique()
array(['MALE', 'FEMALE', nan, '.'], dtype=object)

In the next cell, we can find which rows are having sex of .

penguins.index[penguins["Sex"] == "."].tolist()
[336]

Next, I will use df.isna() to pick out all nan values from data, then use df.sum() to find how many nan values are in each column and save the result in nan_values_summary.

nan_values_summary = penguins.isna().sum()
nan_values_summary

    studyName                0
    Sample Number            0
    Species                  0
    Region                   0
    Island                   0
    Stage                    0
    Individual ID            0
    Clutch Completion        0
    Date Egg                 0
    Culmen Length (mm)       2
    Culmen Depth (mm)        2
    Flipper Length (mm)      2
    Body Mass (g)            2
    Sex                     10
    Delta 15 N (o/oo)       14
    Delta 13 C (o/oo)       13
    Comments               318
    dtype: int64

From the above summary, we notice many missing data in the Comments. I should consider dropping this row. There are more than ten missing data in Delta 15 N (o/oo) 4 and Delta 13 C (o/oo) 13. I might drop these two if I decided not to use these data in my visualization.

We can also create a bar plot for nan values by using matplotlib to understand the number of missing data.

fig, ax = plt.subplots(1)
# create the bar plot
ax = nan_values_summary.plot.bar(figsize=(16,5))
x.set(xlabel = "Column Name",
       ylabel = "Nan Counts",
       title = "Bar Chart of the Missing Data")

ax.legend(labels = ['Nan Values'],
          bbox_to_anchor = (0.12, 1))

# remove border
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

# I find how to add annotate on each bar online from 
# https://www.geeksforgeeks.org/how-to-annotate-bars-in-barplot-with-matplotlib-in-python/
# Iterrating over the bars one-by-one
for bar in ax.patches:
   
  # Using Matplotlib's annotate function and
  # passing the coordinates where the annotation shall be done
  # x-coordinate: bar.get_x() + bar.get_width() / 2
  # y-coordinate: bar.get_height()
  # free space to be left to make graph pleasing: (0, 8)
  # ha and va stand for the horizontal and vertical alignment
    ax.annotate(bar.get_height(),
               (bar.get_x() + bar.get_width() / 2 ,
                bar.get_height()), ha='center', va='center',
                size=10,
                xytext=(0, 6),
                textcoords='offset points',
                color="firebrick") 

b0-bar-charts-nan.png

Making the Plot

In this section I will create some fun interesting data visualization of the Palmer Penguins data set.

For fun, I will pick Date Egg and check how Date Egg associates with Body Mass (g) or Culmen Length (mm).

Cleaning Data

Before making plots, I should clean my data set.

First, I will drop the columns Region, Stage, and the columns which are not helpful in my visualization such has studyName, Sample Number, Individual ID, and Comments.

I don’t think I will use Delta 15 N (o/oo), Delta 13 C (o/oo), and Clutch Completion so I will remove these three columns too.

Then I will save all these column names in the list drop_list.

# Drop columns 
drop_list = ["Region", "Stage", "studyName", "Sample Number", "Individual ID", "Comments","Delta 15 N (o/oo)", "Delta 13 C (o/oo)", "Clutch Completion"]

Next, I will write the function clean_penguins_data() to shorten the species name, drop the rows with sex ., and drop the columns in the drop_list. The function clean_penguins_data() accepts two arguments data_df (the data frame to be cleaned) and drop_list (a list of column names I will drop from the data frame).

def clean_penguins_data(data_df, drop_list):
    """
    This function will shorten the name of the Penguin Species, 
    drop unuseful columns, and clean the missing (nan) values.
    
    Parameters:
    data_df: data frame to be cleaned
    drop_list: a list of column names I will drop from the input data frame
    
    Return:
    A cleaned dataframe df.
    """
    
    df = data_df.copy() # avoid polluting original data set
    
    # Shorten the species name
    df["Species"] = df["Species"].str.split().str.get(0)
    
    # Remove the entries where sex was not recorded
    df = df[penguins["Sex"] != "."]

    #Drop columns 
    df = df.drop(drop_list, axis = 1)
    
    #Find and drop the rows with nan values
    nan_df = df.isna()
    nan_columns = nan_df.any()
    columns_with_nan = df.columns[nan_columns].tolist() 
    df = df.dropna(subset = columns_with_nan)

    return df

Let’s check the cleaned data set.

df = clean_penguins_data(data_df = penguins, drop_list = drop_list)
df
Species Island Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex
0 Adelie Torgersen 11/11/07 39.1 18.7 181.0 3750.0 MALE
1 Adelie Torgersen 11/11/07 39.5 17.4 186.0 3800.0 FEMALE
2 Adelie Torgersen 11/16/07 40.3 18.0 195.0 3250.0 FEMALE
4 Adelie Torgersen 11/16/07 36.7 19.3 193.0 3450.0 FEMALE
5 Adelie Torgersen 11/16/07 39.3 20.6 190.0 3650.0 MALE
... ... ... ... ... ... ... ... ...
338 Gentoo Biscoe 12/1/09 47.2 13.7 214.0 4925.0 FEMALE
340 Gentoo Biscoe 11/22/09 46.8 14.3 215.0 4850.0 FEMALE
341 Gentoo Biscoe 11/22/09 50.4 15.7 222.0 5750.0 MALE
342 Gentoo Biscoe 11/22/09 45.2 14.8 212.0 5200.0 FEMALE
343 Gentoo Biscoe 11/22/09 49.9 16.1 213.0 5400.0 MALE

333 rows × 8 columns

Preparing Data

I will only keep Species, Island, Date Egg, Body Mass (g) in my dataset df for convenience.

Next, I will change the values in Date Egg to the datetime column that reflects year, month, and day (YYYY-MM-DD). We can convert the values to DateTime using the built-in pandas function pd.to_datetime(). The nice thing about this function is that it can automatically detect several common formats of date-time string. Then we can use Series.dt.year to get the year of the datetime and save it in the new column year.

We can also sort the value based on the time of the Date Egg by using df.sort_values().

df["Date Egg"] = pd.to_datetime(df["Date Egg"])
df['year'] = df['Date Egg'].dt.year
df = df.sort_values(by=["Date Egg"])
df
Species Island Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex year
33 Adelie Dream 2007-11-09 40.9 18.9 184.0 3900.0 MALE 2007
32 Adelie Dream 2007-11-09 39.5 17.8 188.0 3300.0 FEMALE 2007
31 Adelie Dream 2007-11-09 37.2 18.1 178.0 3900.0 MALE 2007
30 Adelie Dream 2007-11-09 39.5 16.7 178.0 3250.0 FEMALE 2007
29 Adelie Biscoe 2007-11-10 40.5 18.9 180.0 3950.0 MALE 2007
... ... ... ... ... ... ... ... ... ...
325 Gentoo Biscoe 2009-12-01 46.8 16.1 215.0 5500.0 MALE 2009
337 Gentoo Biscoe 2009-12-01 48.8 16.2 222.0 6000.0 MALE 2009
338 Gentoo Biscoe 2009-12-01 47.2 13.7 214.0 4925.0 FEMALE 2009
313 Gentoo Biscoe 2009-12-01 49.5 16.1 224.0 5650.0 MALE 2009
312 Gentoo Biscoe 2009-12-01 45.5 14.5 212.0 4750.0 FEMALE 2009

333 rows × 9 columns

Plotting

First, we can create a histogram to count each day’s Date Egg by using matplotlib.

# colour map for species
species_mapper = {
    "Adelie" : "red",
    "Gentoo" : "green",
    "Chinstrap":"purple"
}

# create the 5 plots, for different species from different islands.
fig, axes = plt.subplots(1, 5, figsize = (30,10), sharey = True, sharex = True)
axes[0].set_ylabel("Date Egg", fontsize = 15) 
ax_list = axes.tolist()                       # get the list of axes
   
def plot_hist(df, colname, alpha):
    """
    This function is used in the apply() method; 
    it helps draw the histogram of "Date Egg" for each species.
    
    Parameters
    ----------
    df: data frame; 
    colname: string; "Date Egg" column
    alpha: float; a user-specified number for transparency

    Return 
    ----------
    No return value 
    """

    specie_name = df["Species"].unique()[0]   # get species Island name 
    island_name = df["Island"].unique()[0]    # get current Island name 
    ax = ax_list.pop(0)                       # get an axis 
    # set title name for each plot
    ax.set_title(specie_name + " - " + island_name, fontsize = 20)
    ax.hist(df[colname],
            alpha = alpha,
            label = specie_name,
            color = species_mapper[specie_name],
            orientation ='horizontal')
    ax.set_xlabel('Number of Eggs', size=17)
    ax.legend(fontsize = 15)

# groupby method with apply
df.groupby(["Species", "Island"]).apply(plot_hist, "Date Egg", 0.5 )

# add a suptitle to the figure.
fig.suptitle("Histogram of 'Date Egg' ",
             fontweight ="bold", size=26)
# show the plot
plt.show()

b0-hist-nan.png

We can see that the majority of Penguins were born in November.


We can only create the scatterplot of Date Egg against Body Mass (g) on each Island by using seaborn.

fgrid = sns.relplot(data=df,
                    x = "Date Egg",
                    y = "Body Mass (g)",
                    hue = "Species",
                    col = "Island")

fgrid.fig.suptitle("Scatterplot of 'Date Egg' Against 'Body Mass (g)' ", size=16)
fgrid.fig.subplots_adjust(top = 0.85)
# turn 2D into 1d, easier to iterate
axes = fgrid.axes.flatten()
# rotate xticks for each plot
for ax in axes:
    ax.tick_params(axis="x", labelsize=10, labelrotation=40, labelcolor="firebrick")

b0-scatterplot-dateegg-bodymass.png

From above, I wonder why all penguins from the data set only have November and December as their date of Date Egg.


If we want to know the scatterplot of Date Egg against Body Mass (g) for each species with different sex on each Island, we can use Interactive data graphics from plotly.

Plotly includes a very large catalog of interesting plotting capabilities. The Plotly Express module allows us to create several of the most important kinds of plots using convenient, high-level functions.

We also can change plot appearance themes by using plotly.io

We also can use facetting to creating multiple, small plots, each of which display a subset of the data. Plotly supports the easy creation of facets using the facet_col and facet_row arguments.

Because these contents are from PIC 16B, I will import the corresponding modules here instead of at the beginning of this blog.

from plotly import express as px
import plotly.io as pio

# pio.templates.default = "ggplot2"

pio.templates.default = "plotly_white"

fig = px.scatter(data_frame = df,      # data set
                 x = "Date Egg",       # column for x axis
                 y = "Body Mass (g)",  # column for y axis
                 color = "Species",    # column for dot color
                 width = 900,          # width of figure
                 height = 700,         # height of figure
                 opacity = 0.6,        # transparency for dot
                 facet_col = "Sex",    # assign marks to facetted subplots in the horizontal direction.
                 facet_row = "Island", # assign marks to facetted subplots in the vertical direction.
                 title = "Scatterplot of 'Date Egg' Against 'Body Mass (g)' ")  # make a title

# make title centered
fig.update(layout=dict(title=dict(x=0.5)))
# reduce whitespace
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
# show the plot
fig.show()


© French Bulldog, 2022
Written on January 12, 2022