Blog Post 1 - Data Graphics of Climate Data

In this blog post, I will create several interesting, interactive data graphics using the NOAA climate data.

I will import the required Python modules at begining for convenience.

import pandas as pd
import sqlite3
import numpy as np
from plotly import express as px
from sklearn.linear_model import LinearRegression
import seaborn as sns

§1. Create a Database

Since our data is large and reading large data into memory can be slow. Database allows us to store data and to get data back in structured ways without having to load in all the data at once. I will create a database by using sqlite3 package. This package can use to create, manipulate, and query databases. Then we can use pandas package to read the result of database queries as data frames.

The first thing is connect to a database after importing the sqilite3 package.

conn = sqlite3.connect("hw1_database.db")  # create a database in current directory called "hw1_database.db"

I will use pd.read_csv() function directly write our data frame to a database. Supplyong a value of chunksize will return an iterator with the number of rows equal to chunksize.

df_iter = pd.read_csv("temps.csv", chunksize =10000)

I will use df_iter.__next__() to get the first chunksize=(10000) rows of data.

df = df_iter.__next__()  

df

	ID	Year	VALUE1	VALUE2	VALUE3	VALUE4	VALUE5	VALUE6	VALUE7	VALUE8	VALUE9	VALUE10	VALUE11	VALUE12
0	ACW00011604	1961	-89.0	236.0	472.0	773.0	1128.0	1599.0	1570.0	1481.0	1413.0	1174.0	510.0	-39.0
1	ACW00011604	1962	113.0	85.0	-154.0	635.0	908.0	1381.0	1510.0	1393.0	1163.0	994.0	323.0	-126.0
2	ACW00011604	1963	-713.0	-553.0	-99.0	541.0	1224.0	1627.0	1620.0	1596.0	1332.0	940.0	566.0	-108.0
3	ACW00011604	1964	62.0	-85.0	55.0	738.0	1219.0	1442.0	1506.0	1557.0	1221.0	788.0	546.0	112.0
4	ACW00011604	1965	44.0	-105.0	38.0	590.0	987.0	1500.0	1487.0	1477.0	1377.0	974.0	31.0	-178.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	ARM00087222	1997	2858.0	2525.0	2594.0	2435.0	1901.0	1312.0	1529.0	1727.0	2200.0	2350.0	2621.0	2822.0
9996	ARM00087222	1998	2703.0	2438.0	2365.0	2024.0	1730.0	1677.0	NaN	1615.0	1810.0	2442.0	2578.0	3151.0
9997	ARM00087222	1999	2514.0	2826.0	2454.0	1887.0	2004.0	1230.0	1143.0	1585.0	2209.0	2302.0	2340.0	2604.0
9998	ARM00087222	2000	2715.0	NaN	2325.0	2142.0	1535.0	1215.0	1090.0	1580.0	1910.0	2360.0	2418.0	2725.0
9999	ARM00087222	2001	2803.0	2865.0	2593.0	2047.0	1547.0	1305.0	1320.0	1739.0	1816.0	2368.0	2489.0	2729.0

10000 rows × 14 columns

Next, I will create a function prepare_temps_df() to clean the temps.csv data before incorporating this data into my database. Inside this function, I will use df.set_index(),then I will use the df.stack() method to “stacking” all of the data values on top of each other. Then I will use df.reset_index() to recover the columns which I used in df.set_index().

def prepare_temps_df(df):
    """
    This function will clean the `temps.csv` data 
    before incorporating this data into my database
    
    Parameters
    ----------
    df: datafram; temps.csv
    
    Return
    ----------
    No return value
    """
    df = df.set_index(keys=["ID","Year"])
    df = df.stack()
    df = df.reset_index()
    df = df.rename(columns = {"level_2": "Month", 0 : "Temp"})
    df["Month"] = df["Month"].str[5:].astype(int)  # Value1 into 1 as month
    df["FIPS"] = df["ID"].str[0:2]
    df["Temp"]=df["Temp"]/100                      #change to degree C
    return (df)

check clened data

df = prepare_temps_df(df)
df.head()

	ID	Year	Month	Temp	FIPS
0	ACW00011604	1961	1	-0.89	AC
1	ACW00011604	1961	2	2.36	AC
2	ACW00011604	1961	3	4.72	AC
3	ACW00011604	1961	4	7.73	AC
4	ACW00011604	1961	5	11.28	AC

I will populate the table called temperatures in my database. The table is similar to the data frame-like object. The df.to_sql() method writes to a specified table in the database.

conn allows us to put that table in the database to which I have my current connection.

if_exists = append shows if table exists; I don’t want to replace it with the chunk. It will add all the chunks together.

index = False will ignore the index because I don’t want to put the index in its own column.

df_iter = pd.read_csv("temps.csv", chunksize =10000)
for df in df_iter:
    df = prepare_temps_df(df)
    df.to_sql("temperatures", conn, if_exists = "append", index = False)

Since the data station-metadata.csv is small, I don’t need to read it in chunks.

stations = pd.read_csv("station-metadata.csv")
stations.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

add the data "station-metadata.csv" to my database with name stations.

stations.to_sql("stations", conn, if_exists = "replace", index = False)

Since the data "countries.csv" is small, I don’t need to read it in by chuncks.

countries = pd.read_csv("countries.csv")
countries["FIPS"]=countries["FIPS 10-4"]  # I will use 'FIPS' instead of 'FIPS 10-4' for convenience
countries.head()

	FIPS 10-4	ISO 3166	Name	FIPS
0	AF	AF	Afghanistan	AF
1	AX	-	Akrotiri	AX
2	AL	AL	Albania	AL
3	AG	DZ	Algeria	AG
4	AQ	AS	American Samoa	AQ

add the data “countries.csv” to my database with name countries.

countries.to_sql("countries", conn, if_exists = "replace", index = False)

Next, I will check the three tables that I just created in my database.

The cursor is the primary way to intersect with database. The cursor executes SQL commands.

cursor = conn.cursor()

The I will execute command SELECT name FROM sqlite_master WHERE type=’table’.

SELECT name: Show me the entries in the name column
FROM sqlite_master : of the sqlite_master table
WHERE type='table: subject to the condition that the entry in the type column of sqlite_master is equal to table (i.e. don’t include other kinds of objects)

Also, cursor.fetchall() returns the a list containing all the items returned by the query, which we then print.

Now, I have a database containing three tables; I will check that this is indeed the case.

cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('temperatures',), ('stations',), ('countries',)]

I will check the items in each table to make sure I have populated my database correctly.

cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")
for result in cursor.fetchall():
    print(result[0])

CREATE TABLE "temperatures" (
"ID" TEXT,
  "Year" INTEGER,
  "Month" INTEGER,
  "Temp" REAL,
  "FIPS" TEXT
)
CREATE TABLE "stations" (
"ID" TEXT,
  "LATITUDE" REAL,
  "LONGITUDE" REAL,
  "STNELEV" REAL,
  "NAME" TEXT
)
CREATE TABLE "countries" (
"FIPS 10-4" TEXT,
  "ISO 3166" TEXT,
  "Name" TEXT,
  "FIPS" TEXT
)

Now I will close my database. I will reopen the connection if I need to pull some data from my database again.

conn.close()  # close database

§2. Write a Query Function

In this section I will write a function called query_climate_database() which accepts four arguments: country, a string giving the name of a country for which data should be returned. year_begin and year_end, two integers giving the earliest and latest years for which should be returned. month, an integer giving the month of the year for which should be returned.

The return value of query_climate_database() is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:

The station name.
The latitude of the station.
The longitude of the station.
The name of the country in which the station is located.
The year in which the reading was taken.
The month in which the reading was taken.
The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so I don’t have to do any aggregation at this stage.)

def query_climate_database (country, year_begin, year_end, month):
    """
    This function will create a dataframe from a user-specified country,
    with associated station name, the station's latitude, 
    the station's longitude, the year in which the reading was taken,
    and the month in which the reading was taken.
    
    Parameters
    ----------
    country: string; giving the name of a country for which data should be returned
    year_begin: int;  giving the earliest year for which should be returned
    year_end: int;  giving the latest year for which should be returned
    month: int;  giving the month of the year for which should be returned
    
    Return
    ----------
    a Pandas dataframe of temperature readings for the specified country,
    in the specified date range, in the specified month of the year.
    """
    
    where_condition = "WHERE year>=" + str(year_begin) + " AND " + "year<=" + str(year_end)+ " AND month=" +str(month)+ " AND C.name ='" + country+ "'"  
    
    cmd =\
    """
    SELECT  S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
    FROM temperatures T
    JOIN stations S ON T.id = S.id
    JOIN countries C ON T.fips = C.fips
    """
    cmd += where_condition 
      
    return pd.read_sql_query(cmd, conn)

Finally, we can call our function query_climate_database() to make the pandas dataframe of temperature readings for India, in the years 1980-2020, in January.

conn = sqlite3.connect("hw1_database.db")               # reopen the connection 
query_climate_database(country="India", year_begin=1980, year_end=2020, month=1)

	NAME	LATITUDE	LONGITUDE	Name	Year	Month	Temp
0	PBO_ANANTAPUR	14.583	77.633	India	1980	1	23.48
1	PBO_ANANTAPUR	14.583	77.633	India	1981	1	24.57
2	PBO_ANANTAPUR	14.583	77.633	India	1982	1	24.19
3	PBO_ANANTAPUR	14.583	77.633	India	1983	1	23.51
4	PBO_ANANTAPUR	14.583	77.633	India	1984	1	24.81
...	...	...	...	...	...	...	...
3147	DARJEELING	27.050	88.270	India	1983	1	5.10
3148	DARJEELING	27.050	88.270	India	1986	1	6.90
3149	DARJEELING	27.050	88.270	India	1994	1	8.10
3150	DARJEELING	27.050	88.270	India	1995	1	5.60
3151	DARJEELING	27.050	88.270	India	1997	1	5.70

3152 rows × 7 columns

conn.close()  # close database

§3. Write a Geographic Scatter Function for Yearly Temperature Increases

In this part, I will write a function temperature_coefficient_plot() to create visualizations that address the following question:

How does the average yearly change in temperature vary within a given country?

We will visualization that:

How does the average yearly change in temperature vary within India from 1980 to 2020?

First, I will create a function coef() to compute the first coefficient of a linear regression mode at the user-specified station.

def coef(data_group):
    """
    This function computes the first coefficient of a linear regression model
    at the user-specified station
    
    Parameters
    ----------
    data_group: datafram; giving the name of a country for which data should be returned

    Return
    ----------
    An array of the first coefficient
    """
    
    x = data_group[["Year"]] # x is df
    y = data_group["Temp"]   # y is a series
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]
    

Then I will write a function called temperature_coefficient_plot(). This function should accept five explicit arguments, and an undetermined number of keyword arguments. country, year_begin, year_end, and month should be as in the previous part. min_obs, the minimum required number of years of data for any given station. Only data for stations with at least min_obs years worth of data in the specified month should be plotted; the others should be filtered out.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """
    This function will create a Geographic Scatter plot for yearly temperature increases
    from the user-specified country and between a specified year range with a specified 
    month. The year should have more than a user-specified minimum required the number 
    of years.
    
    Parameters
    ----------
    country: string; giving the name of a country for which data should be returned
    year_begin: int;  giving the earliest year for which should be returned
    year_end: int;  giving the latest year for which should be returned
    month: int;  giving the month of the year for which should be returned
    min_obs: int; the minimum required number of years of data for any given station
    **kwargs: additional keyword arguments passed to px.scatter_mapbox()
    
    Return
    ----------
    a Pandas dataframe of the average yearly change in temperature vary within a given country
    """

    
    month_mapper ={1 : "January",  2 : "February",  3 : "March",
                   4 : "April",    5 : "May",       6 : "Jun",
                   7 : "July",     8 : "August",    9 : "September",
                  10 : "October", 11 : "November", 12 : "December" }
    
    # retrieve the data from function query_climate_database()
    df_1 = query_climate_database(country=country, year_begin=year_begin, year_end=year_end, month=month)
    
    df_2 = df_1.groupby(["NAME","LATITUDE","LONGITUDE","Month"]).apply(coef)
    
    df_2 = df_2.reset_index()

    df_2["Estimated Yearly Increase (\N{DEGREE SIGN}C)"] = df_2[0].round(4) # round estimates
    
    df_2 = df_2.drop([0], axis =1)

    # make sure we have the minimum required number of years of data for any given station.
    if (len(df_1["Year"].unique()) >=  min_obs):
        fig = px.scatter_mapbox(df_2,
                                lat = "LATITUDE",
                                lon = "LONGITUDE",
                                hover_name="NAME",
                                height = 300,
                                color ="Estimated Yearly Increase (\N{DEGREE SIGN}C)",
                                color_continuous_midpoint = 0,  # make colorbar is centered at 0
                                **kwargs)
        
        title_name = "Estimates of yearly increase in temperature in " + month_mapper[1]  + "<br>for stations in " + country + ", years " + str(year_begin) + "-" +str(year_end)
        fig.update_layout(margin={"r":0,"t":50,"l":10,"b":0},
                          title = title_name)
        fig.layout.coloraxis.colorbar.title ="Estimated Yearly <br>Increase (\N{DEGREE SIGN}C)"
      
        return fig
    
    else:
        print("We do not have enough " + str(min_obs) + " years of data")
        

visualization our data for “Estimates of yearly increase in temperature in January for stations in India, years 1980-2020.

conn = sqlite3.connect("hw1_database.db")  # connect the database

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 3,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)
fig.show()

conn.close()  # close the database

§4. Create Two More Interesting Figures

In this section I will create two more complex and interesting interactive data visualizations using the same data set with different types. In each case, I will construct my visualization from data obtained by querying the database that I created in part 1.

4.1

I will make a scatterplot of the station name starting with a user specified letter in the user specified country within a giving range of years to address the following question:

How does the monthly change in temperature in the given years vary within a given station?

I will creat the function query_function_4_1() to extracting data from uer specified country within the given yearls.

def query_function_4_1 (country, year_begin, year_end):
    """
    This function will extracting data and create a datafram
    from user specified country, with associated station name,
    latitude of the station, longitude of the station, year in which the reading was taken,
    and month in which the reading was taken.
    
    Parameters
    ----------
    country: string; giving the name of a country for which data should be returned
    year_begin: int;  giving the earliest year for which should be returned
    year_end: int;  giving the latest year for which should be returned
    
    Return
    ----------
    a Pandas dataframe of the user specified country with associated informations
    """
    
 
    where_condition = "WHERE year>=" + str(year_begin) + " AND " + "year<=" + str(year_end)+ " AND C.name ='" + country+ "'" 
    
    cmd =\
    """
    SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
    FROM temperatures T
    JOIN stations S ON T.id = S.id
    JOIN countries C ON T.fips = C.fips
    """

    cmd += where_condition 
    
    return pd.read_sql_query(cmd, conn)

Then I will creat the function plot4_1() to make my plot.

def plot4_1 (country, year_begin, year_end, starting_letter):
    """
    This function make scatterplot of the station name starting with "G" 
    in the user specified country within a giving range of years
    
    Parameters
    ----------
    country: string; giving the name of a country for which data should be returned
    year_begin: int;  giving the earliest year for which should be returned
    year_end: int;  giving the latest year for which should be returned
    starting_letter: char; Stations name which starting with starting_letter.
    
    Return
    ----------
    No return value
    
    """
    
    # extracting data
    df_4 = query_function_4_1(country=country, year_begin=year_begin, year_end=year_end)
    # station name starting with starting_letter 
    begins_with = df_4[df_4["NAME"].str[0]== starting_letter]
    
    begins_with = begins_with.rename(columns ={"Temp" : "Temperatures (\N{DEGREE SIGN}C)"})
    fig = px.scatter(data_frame=begins_with,
                     x = "Month",
                     y = "Temperatures (\N{DEGREE SIGN}C)",
                     facet_col = "Year",
                     facet_row ="NAME")
    
    
    # make and center the title
    fig.update_layout(title_text="Scatterplot of Temperatures (\N{DEGREE SIGN}C) Against Month", title_x=0.5)
    fig.show()
  

Visualize our data for the monthly change in temperature in Japan’s stations with a name starting with “G,” the year 2000-2001.

conn = sqlite3.connect("hw1_database.db")

# stations from Japan with name starting with "G", in year 2000-2001
plot4_1("Japan",2000,2001, starting_letter="G")

conn.close()  # close the database

4.2

I will make a line plot of the stations at the user specified latitude to address the following question:

How does the temperature vary within given stations at the same latitude?

First, I will make a query_function_4_2() function to extract station data at the specified latitude.

def query_function_4_2 (latitude_at):
    """
    This function will extracting data and create a datafram
    for stations where its latitude = latitude_at.
    The returned data has columns 
    id, year, month, temp, name, latitude, longitude
    
    Parameters
    ----------
    latitude_at: int; a user specified latitude 
    
    Return
    ----------
    a Pandas dataframe of the ave
    
    
    
    """
    where_condition = "WHERE S.latitude=" + str(latitude_at)
    
    cmd =\
    """
    SELECT  T.id, T.year, T.month, T.temp, S.name, S.latitude, S.longitude
    FROM temperatures T
    JOIN stations S ON T.id = S.id
    WHERE S.latitude=0
    """

    return pd.read_sql_query(cmd, conn)

Then I will make a function called plot4_2() to visualize how the temperature varies within a given station at the same latitude.

Inside the function plot4_2(), I will create a DateTime column from the returned data of the function query_function_4_2() that reflects year, month, and day (YYYY-MM-DD). We can convert the values to DateTime using the built-in pandas function pd.to_datetime(). The nice thing about this function is that it can automatically detect several common formats of date-time string.

def plot4_2 (latitude_at):
    """
    This function will make a line graph of the stations at
    the user specified latitude.
    
    Parameters
    ----------
    latitude_at: int; stations' latitude
    
    ----------
    Return 
    No return value
    """
    
    df_4_2 = query_function_4_2(latitude_at=latitude_at)  # get the data where stations at LATITUDE=0
    
    df_4_2["Date"]= df_4_2["Year"].astype(str)+ "-" + df_4_2["Month"].astype(str)
    
    # create datetime columnfrom
    df_4_2["Date"]= pd.to_datetime(df_4_2["Date"])
    
    # compute the average temperature across all stations where LATITUDE=0 in each month
    averges = df_4_2.groupby(["NAME", "Date"])[["Temp"]].mean()
    
    # recover the "NAME","Date" columns using reset_index()
    averges=averges.reset_index()
    averges = averges.rename(columns ={"NAME" : "Station Name"})
    
    fig = px.line(data_frame=averges,
              x = "Date" ,
              y = "Temp",
              color= "Station Name")
    
    # make and center the title
    fig.update_layout(title_text="Line Graph of Stations at Latitude at "+str(latitude_at), title_x=0.5)

    fig.show()
    
    

Visualize our data for the temperature variation within stations at latitude 0.

conn = sqlite3.connect("hw1_database.db")
# make a line graph of the stations at latitude = 0
plot4_2(latitude_at=0)
conn.close()

Written on January 18, 2022