Blog Post 1 - Data Graphics of Climate Data
In this blog post, I will create several interesting, interactive data graphics using the NOAA climate data.
I will import the required Python modules at begining for convenience.
import pandas as pd
import sqlite3
import numpy as np
from plotly import express as px
from sklearn.linear_model import LinearRegression
import seaborn as sns
§1. Create a Database
Since our data is large and reading large data into memory can be slow. Database allows us to store data and to get data back in structured ways without having to load in all the data at once. I will create a database by using sqlite3
package. This package can use to create, manipulate, and query databases. Then we can use pandas
package to read the result of database queries as data frames.
The first thing is connect to a database after importing the sqilite3
package.
conn = sqlite3.connect("hw1_database.db") # create a database in current directory called "hw1_database.db"
I will use pd.read_csv()
function directly write our data frame to a database. Supplyong a value of chunksize
will return an iterator
with the number of rows equal to chunksize
.
df_iter = pd.read_csv("temps.csv", chunksize =10000)
I will use df_iter.__next__()
to get the first chunksize
=(10000) rows of data.
df = df_iter.__next__()
df
ID | Year | VALUE1 | VALUE2 | VALUE3 | VALUE4 | VALUE5 | VALUE6 | VALUE7 | VALUE8 | VALUE9 | VALUE10 | VALUE11 | VALUE12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACW00011604 | 1961 | -89.0 | 236.0 | 472.0 | 773.0 | 1128.0 | 1599.0 | 1570.0 | 1481.0 | 1413.0 | 1174.0 | 510.0 | -39.0 |
1 | ACW00011604 | 1962 | 113.0 | 85.0 | -154.0 | 635.0 | 908.0 | 1381.0 | 1510.0 | 1393.0 | 1163.0 | 994.0 | 323.0 | -126.0 |
2 | ACW00011604 | 1963 | -713.0 | -553.0 | -99.0 | 541.0 | 1224.0 | 1627.0 | 1620.0 | 1596.0 | 1332.0 | 940.0 | 566.0 | -108.0 |
3 | ACW00011604 | 1964 | 62.0 | -85.0 | 55.0 | 738.0 | 1219.0 | 1442.0 | 1506.0 | 1557.0 | 1221.0 | 788.0 | 546.0 | 112.0 |
4 | ACW00011604 | 1965 | 44.0 | -105.0 | 38.0 | 590.0 | 987.0 | 1500.0 | 1487.0 | 1477.0 | 1377.0 | 974.0 | 31.0 | -178.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | ARM00087222 | 1997 | 2858.0 | 2525.0 | 2594.0 | 2435.0 | 1901.0 | 1312.0 | 1529.0 | 1727.0 | 2200.0 | 2350.0 | 2621.0 | 2822.0 |
9996 | ARM00087222 | 1998 | 2703.0 | 2438.0 | 2365.0 | 2024.0 | 1730.0 | 1677.0 | NaN | 1615.0 | 1810.0 | 2442.0 | 2578.0 | 3151.0 |
9997 | ARM00087222 | 1999 | 2514.0 | 2826.0 | 2454.0 | 1887.0 | 2004.0 | 1230.0 | 1143.0 | 1585.0 | 2209.0 | 2302.0 | 2340.0 | 2604.0 |
9998 | ARM00087222 | 2000 | 2715.0 | NaN | 2325.0 | 2142.0 | 1535.0 | 1215.0 | 1090.0 | 1580.0 | 1910.0 | 2360.0 | 2418.0 | 2725.0 |
9999 | ARM00087222 | 2001 | 2803.0 | 2865.0 | 2593.0 | 2047.0 | 1547.0 | 1305.0 | 1320.0 | 1739.0 | 1816.0 | 2368.0 | 2489.0 | 2729.0 |
10000 rows × 14 columns
Next, I will create a function prepare_temps_df()
to clean the temps.csv
data before incorporating this data into my database.
Inside this function, I will use df.set_index()
,then I will use the df.stack()
method to “stacking” all of the data values on top of each other. Then I will use df.reset_index()
to recover the columns which I used in df.set_index()
.
def prepare_temps_df(df):
"""
This function will clean the `temps.csv` data
before incorporating this data into my database
Parameters
----------
df: datafram; temps.csv
Return
----------
No return value
"""
df = df.set_index(keys=["ID","Year"])
df = df.stack()
df = df.reset_index()
df = df.rename(columns = {"level_2": "Month", 0 : "Temp"})
df["Month"] = df["Month"].str[5:].astype(int) # Value1 into 1 as month
df["FIPS"] = df["ID"].str[0:2]
df["Temp"]=df["Temp"]/100 #change to degree C
return (df)
check clened data
df = prepare_temps_df(df)
df.head()
ID | Year | Month | Temp | FIPS | |
---|---|---|---|---|---|
0 | ACW00011604 | 1961 | 1 | -0.89 | AC |
1 | ACW00011604 | 1961 | 2 | 2.36 | AC |
2 | ACW00011604 | 1961 | 3 | 4.72 | AC |
3 | ACW00011604 | 1961 | 4 | 7.73 | AC |
4 | ACW00011604 | 1961 | 5 | 11.28 | AC |
I will populate the table called temperatures
in my database. The table is similar to the data frame-like object. The df.to_sql()
method writes to a specified table in the database.
conn
allows us to put that table in the database to which I have my current connection.
if_exists = append
shows if table exists; I don’t want to replace it with the chunk. It will add all the chunks together.
index = False
will ignore the index because I don’t want to put the index in its own column.
df_iter = pd.read_csv("temps.csv", chunksize =10000)
for df in df_iter:
df = prepare_temps_df(df)
df.to_sql("temperatures", conn, if_exists = "append", index = False)
Since the data station-metadata.csv
is small, I don’t need to read it in chunks.
stations = pd.read_csv("station-metadata.csv")
stations.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
add the data "station-metadata.csv"
to my database with name stations
.
stations.to_sql("stations", conn, if_exists = "replace", index = False)
Since the data "countries.csv"
is small, I don’t need to read it in by chuncks.
countries = pd.read_csv("countries.csv")
countries["FIPS"]=countries["FIPS 10-4"] # I will use 'FIPS' instead of 'FIPS 10-4' for convenience
countries.head()
FIPS 10-4 | ISO 3166 | Name | FIPS | |
---|---|---|---|---|
0 | AF | AF | Afghanistan | AF |
1 | AX | - | Akrotiri | AX |
2 | AL | AL | Albania | AL |
3 | AG | DZ | Algeria | AG |
4 | AQ | AS | American Samoa | AQ |
add the data “countries.csv” to my database with name countries.
countries.to_sql("countries", conn, if_exists = "replace", index = False)
Next, I will check the three tables that I just created in my database.
The cursor
is the primary way to intersect with database. The cursor execute
s SQL commands.
cursor = conn.cursor()
The I will execute command SELECT name FROM sqlite_master WHERE type=’table’.
SELECT name
: Show me the entries in thename
columnFROM sqlite_master
: of thesqlite_master
tableWHERE type='table
: subject to the condition that the entry in the type column of sqlite_master is equal to table (i.e. don’t include other kinds of objects)
Also, cursor.fetchall()
returns the a list containing all the items returned by the query, which we then print.
Now, I have a database containing three tables; I will check that this is indeed the case.
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]
I will check the items in each table to make sure I have populated my database correctly.
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")
for result in cursor.fetchall():
print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
"Year" INTEGER,
"Month" INTEGER,
"Temp" REAL,
"FIPS" TEXT
)
CREATE TABLE "stations" (
"ID" TEXT,
"LATITUDE" REAL,
"LONGITUDE" REAL,
"STNELEV" REAL,
"NAME" TEXT
)
CREATE TABLE "countries" (
"FIPS 10-4" TEXT,
"ISO 3166" TEXT,
"Name" TEXT,
"FIPS" TEXT
)
Now I will close my database. I will reopen the connection if I need to pull some data from my database again.
conn.close() # close database
§2. Write a Query Function
In this section I will write a function called query_climate_database()
which accepts four
arguments:
country
, a string giving the name of a country for which data should be returned.
year_begin
and year_end
, two integers giving the earliest and latest years for which should be returned.
month
, an integer giving the month of the year for which should be returned.
The return value of query_climate_database()
is a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year. This dataframe should have columns for:
-
The station name.
-
The latitude of the station.
-
The longitude of the station.
-
The name of the country in which the station is located.
-
The year in which the reading was taken.
-
The month in which the reading was taken.
-
The average temperature at the specified station during the specified year and month. (Note: the temperatures in the raw data are already averages by month, so I don’t have to do any aggregation at this stage.)
def query_climate_database (country, year_begin, year_end, month):
"""
This function will create a dataframe from a user-specified country,
with associated station name, the station's latitude,
the station's longitude, the year in which the reading was taken,
and the month in which the reading was taken.
Parameters
----------
country: string; giving the name of a country for which data should be returned
year_begin: int; giving the earliest year for which should be returned
year_end: int; giving the latest year for which should be returned
month: int; giving the month of the year for which should be returned
Return
----------
a Pandas dataframe of temperature readings for the specified country,
in the specified date range, in the specified month of the year.
"""
where_condition = "WHERE year>=" + str(year_begin) + " AND " + "year<=" + str(year_end)+ " AND month=" +str(month)+ " AND C.name ='" + country+ "'"
cmd =\
"""
SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
FROM temperatures T
JOIN stations S ON T.id = S.id
JOIN countries C ON T.fips = C.fips
"""
cmd += where_condition
return pd.read_sql_query(cmd, conn)
Finally, we can call our function query_climate_database()
to make the pandas dataframe of temperature readings for India, in the years 1980-2020, in January.
conn = sqlite3.connect("hw1_database.db") # reopen the connection
query_climate_database(country="India", year_begin=1980, year_end=2020, month=1)
NAME | LATITUDE | LONGITUDE | Name | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 |
1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 |
2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 |
3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 |
4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 |
... | ... | ... | ... | ... | ... | ... | ... |
3147 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 |
3148 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 |
3149 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 |
3150 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 |
3151 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 |
3152 rows × 7 columns
conn.close() # close database
§3. Write a Geographic Scatter Function for Yearly Temperature Increases
In this part, I will write a function temperature_coefficient_plot()
to create visualizations that address the following question:
How does the average yearly change in temperature vary within a given country?
We will visualization that:
How does the average yearly change in temperature vary within India from 1980 to 2020?
First, I will create a function coef()
to compute the first coefficient of a linear regression mode at the user-specified station.
def coef(data_group):
"""
This function computes the first coefficient of a linear regression model
at the user-specified station
Parameters
----------
data_group: datafram; giving the name of a country for which data should be returned
Return
----------
An array of the first coefficient
"""
x = data_group[["Year"]] # x is df
y = data_group["Temp"] # y is a series
LR = LinearRegression()
LR.fit(x, y)
return LR.coef_[0]
Then I will write a function called temperature_coefficient_plot()
. This function should accept five
explicit arguments, and an undetermined number of keyword arguments
.
country
, year_begin
, year_end
, and month
should be as in the previous part.
min_obs
, the minimum required number of years of data for any given station. Only data for stations with at least min_obs years worth of data in the specified month should be plotted; the others should be filtered out.
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
"""
This function will create a Geographic Scatter plot for yearly temperature increases
from the user-specified country and between a specified year range with a specified
month. The year should have more than a user-specified minimum required the number
of years.
Parameters
----------
country: string; giving the name of a country for which data should be returned
year_begin: int; giving the earliest year for which should be returned
year_end: int; giving the latest year for which should be returned
month: int; giving the month of the year for which should be returned
min_obs: int; the minimum required number of years of data for any given station
**kwargs: additional keyword arguments passed to px.scatter_mapbox()
Return
----------
a Pandas dataframe of the average yearly change in temperature vary within a given country
"""
month_mapper ={1 : "January", 2 : "February", 3 : "March",
4 : "April", 5 : "May", 6 : "Jun",
7 : "July", 8 : "August", 9 : "September",
10 : "October", 11 : "November", 12 : "December" }
# retrieve the data from function query_climate_database()
df_1 = query_climate_database(country=country, year_begin=year_begin, year_end=year_end, month=month)
df_2 = df_1.groupby(["NAME","LATITUDE","LONGITUDE","Month"]).apply(coef)
df_2 = df_2.reset_index()
df_2["Estimated Yearly Increase (\N{DEGREE SIGN}C)"] = df_2[0].round(4) # round estimates
df_2 = df_2.drop([0], axis =1)
# make sure we have the minimum required number of years of data for any given station.
if (len(df_1["Year"].unique()) >= min_obs):
fig = px.scatter_mapbox(df_2,
lat = "LATITUDE",
lon = "LONGITUDE",
hover_name="NAME",
height = 300,
color ="Estimated Yearly Increase (\N{DEGREE SIGN}C)",
color_continuous_midpoint = 0, # make colorbar is centered at 0
**kwargs)
title_name = "Estimates of yearly increase in temperature in " + month_mapper[1] + "<br>for stations in " + country + ", years " + str(year_begin) + "-" +str(year_end)
fig.update_layout(margin={"r":0,"t":50,"l":10,"b":0},
title = title_name)
fig.layout.coloraxis.colorbar.title ="Estimated Yearly <br>Increase (\N{DEGREE SIGN}C)"
return fig
else:
print("We do not have enough " + str(min_obs) + " years of data")
visualization our data for “Estimates of yearly increase in temperature in January for stations in India, years 1980-2020.
conn = sqlite3.connect("hw1_database.db") # connect the database
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 3,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
fig.show()
conn.close() # close the database
§4. Create Two More Interesting Figures
In this section I will create two more complex and interesting interactive data visualizations using the same data set with different types. In each case, I will construct my visualization from data obtained by querying the database that I created in part 1.
4.1
I will make a scatterplot of the station name starting with a user specified letter in the user specified country within a giving range of years to address the following question:
How does the monthly change in temperature in the given years vary within a given station?
I will creat the function query_function_4_1()
to extracting data from uer specified country within the given yearls.
def query_function_4_1 (country, year_begin, year_end):
"""
This function will extracting data and create a datafram
from user specified country, with associated station name,
latitude of the station, longitude of the station, year in which the reading was taken,
and month in which the reading was taken.
Parameters
----------
country: string; giving the name of a country for which data should be returned
year_begin: int; giving the earliest year for which should be returned
year_end: int; giving the latest year for which should be returned
Return
----------
a Pandas dataframe of the user specified country with associated informations
"""
where_condition = "WHERE year>=" + str(year_begin) + " AND " + "year<=" + str(year_end)+ " AND C.name ='" + country+ "'"
cmd =\
"""
SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
FROM temperatures T
JOIN stations S ON T.id = S.id
JOIN countries C ON T.fips = C.fips
"""
cmd += where_condition
return pd.read_sql_query(cmd, conn)
Then I will creat the function plot4_1()
to make my plot.
def plot4_1 (country, year_begin, year_end, starting_letter):
"""
This function make scatterplot of the station name starting with "G"
in the user specified country within a giving range of years
Parameters
----------
country: string; giving the name of a country for which data should be returned
year_begin: int; giving the earliest year for which should be returned
year_end: int; giving the latest year for which should be returned
starting_letter: char; Stations name which starting with starting_letter.
Return
----------
No return value
"""
# extracting data
df_4 = query_function_4_1(country=country, year_begin=year_begin, year_end=year_end)
# station name starting with starting_letter
begins_with = df_4[df_4["NAME"].str[0]== starting_letter]
begins_with = begins_with.rename(columns ={"Temp" : "Temperatures (\N{DEGREE SIGN}C)"})
fig = px.scatter(data_frame=begins_with,
x = "Month",
y = "Temperatures (\N{DEGREE SIGN}C)",
facet_col = "Year",
facet_row ="NAME")
# make and center the title
fig.update_layout(title_text="Scatterplot of Temperatures (\N{DEGREE SIGN}C) Against Month", title_x=0.5)
fig.show()
Visualize our data for the monthly change in temperature in Japan’s stations with a name starting with “G,” the year 2000-2001.
conn = sqlite3.connect("hw1_database.db")
# stations from Japan with name starting with "G", in year 2000-2001
plot4_1("Japan",2000,2001, starting_letter="G")
conn.close() # close the database
4.2
I will make a line plot of the stations at the user specified latitude to address the following question:
How does the temperature vary within given stations at the same latitude?
First, I will make a query_function_4_2()
function to extract station data at the specified latitude.
def query_function_4_2 (latitude_at):
"""
This function will extracting data and create a datafram
for stations where its latitude = latitude_at.
The returned data has columns
id, year, month, temp, name, latitude, longitude
Parameters
----------
latitude_at: int; a user specified latitude
Return
----------
a Pandas dataframe of the ave
"""
where_condition = "WHERE S.latitude=" + str(latitude_at)
cmd =\
"""
SELECT T.id, T.year, T.month, T.temp, S.name, S.latitude, S.longitude
FROM temperatures T
JOIN stations S ON T.id = S.id
WHERE S.latitude=0
"""
return pd.read_sql_query(cmd, conn)
Then I will make a function called plot4_2()
to visualize how the temperature varies within a given station at the same latitude.
Inside the function plot4_2()
, I will create a DateTime column from the returned data of the function query_function_4_2()
that reflects year, month, and day (YYYY-MM-DD). We can convert the values to DateTime using the built-in pandas function pd.to_datetime()
. The nice thing about this function is that it can automatically detect several common formats of date-time string.
def plot4_2 (latitude_at):
"""
This function will make a line graph of the stations at
the user specified latitude.
Parameters
----------
latitude_at: int; stations' latitude
----------
Return
No return value
"""
df_4_2 = query_function_4_2(latitude_at=latitude_at) # get the data where stations at LATITUDE=0
df_4_2["Date"]= df_4_2["Year"].astype(str)+ "-" + df_4_2["Month"].astype(str)
# create datetime columnfrom
df_4_2["Date"]= pd.to_datetime(df_4_2["Date"])
# compute the average temperature across all stations where LATITUDE=0 in each month
averges = df_4_2.groupby(["NAME", "Date"])[["Temp"]].mean()
# recover the "NAME","Date" columns using reset_index()
averges=averges.reset_index()
averges = averges.rename(columns ={"NAME" : "Station Name"})
fig = px.line(data_frame=averges,
x = "Date" ,
y = "Temp",
color= "Station Name")
# make and center the title
fig.update_layout(title_text="Line Graph of Stations at Latitude at "+str(latitude_at), title_x=0.5)
fig.show()
Visualize our data for the temperature variation within stations at latitude 0.
conn = sqlite3.connect("hw1_database.db")
# make a line graph of the stations at latitude = 0
plot4_2(latitude_at=0)
conn.close()