For the past few weeks, the Camp Fire in Northern California has been a deadly and destructive force, recieving enormous media coverage. This deadly wildfire that began on November 8, 2018 has left 85 people dead, 249 missing, and destroyed thousands of buildings. It has resulted in major air quality issues, causing many to wear masks or stay indoors to protect their health. But this is not the first time a major wildfire has ravaged the west coast of the United States.
In this tutorial, we will analyze west coast wildfire data from 2010 to 2016, available through the University of California Irvine Data Science Initiative GitHub organization here. This data has been obtained through the NASA Moderate Resolution Imaging Spectroradiometer (MODIS) satellite. We will merge this wildfire data with daily weather data obtained from NOAA Climate Data Online tool.
We will begin by loading the data and performing exploratory data analysis before diving deeper into regression analysis and classification analysis using basic machine learning methods. These data science methods will allow us to look at past trends and make inferences about future west coast wildfires.
We will begin by importing Python packages that will help us to load and wrangle the data:
import pandas as pd
import numpy as np
The wildfire dataset is contained in a comma-separated values file (CSV) located on GitHub. Pandas has a read_csv
function that allows us to load a data frame from a CSV over the web. We will specify the data types of each column using a python dict
to ensure that pandas loads the data using the data types that we will want to use in our analysis.
data_url = 'https://raw.githubusercontent.com/UCIDataScienceInitiative/Climate_Hackathon/master/west_coast_fires/fires.csv'
dtype = {
'confidence': float,
'day': int,
'frp': float,
'lat': float,
'lon': float,
'month': int,
'year': int,
'x': float,
'y': float,
'dayofyear': int,
'vdp': float,
'temp': float,
'humidity': float
}
df = pd.read_csv(data_url, dtype=dtype)
df.head()
We can use the shape
variable to learn how many fire observations are contained in this dataset:
df.shape
The next step is to determine what this wildfire data means so that we can perform exploration. Luckily, the README for the GitHub repository provides us with a description of each column:
id
: unique identifier of the fire in the overall context of the world datasetconfidence
: how much confidence the satellite has that this is actually a fire detection (percent)day
: the day of the monthfrp
: Fire Radiative Power, the strength of the firelat
: latitudelon
: longitudemonth
: month of the yearyear
: yearx
: x position in a uniformly-spaced gridy
: y position in a uniformly-spaced griddayofyear
: day of the year (from 0 to 364 or 365 for leap years)vpd
: Vapor Pressure Deficit, the difference between the moisture in the air and the amount of moisture the air could holdtemp
: temperature (degrees Kelvin)humidity
: humidity (percent)To be extra cautious, we can restrict our analysis to those fires with high confidence.
confidence_threshold = 0.8
df = df.loc[df['confidence'] > confidence_threshold]
We can observe the size of our data after filtering by confidence
threshold.
df.shape
We will also want to exclude fires with a strength of zero.
frp_threshold = 1
df = df.loc[df['frp'] > frp_threshold]
We can again observe the size of our data, this time after filtering by frp
threshold.
df.shape
The wildfire dataset README notes that the same fire may be split into separate observations if it spreads over multiple locations or last multiple days. We can try to remedy this by consolidating fire observations which:
According to the Wildfire page on Wikipedia, forest fires can spread as fast as 10.8 km per hour. Using this information, we can consider that in a single day, it is possible that a fire could spread approximately 259 kilometers:
wildfire_max_daily_spread = 10.8*24
wildfire_max_daily_spread
To easily identify consecutive days, we will first add a column called dayofdataset
which is similar to dayofyear
but does not restart at the end of the year.
min_year = df['year'].min()
max_year = df['year'].max()
df['dayofdataset'] = df.apply(lambda row: (row['year'] - min_year) * 365 + row['dayofyear'], axis='columns')
Next we will take the difference of dayofdataset
for each each row with the previous row. The resulting groups of rows that are 1 day or 0 days apart will be the candidates for our consolidation procedure.
df['dayofdataset_diff'] = df['dayofdataset'].diff()
df.head()
Based on this date difference column, we can give each group of consecutive dates a different index, so that we can use groupby
to iterate over each group of dates.
df = df.reset_index(drop=True)
df['date_group_index'] = np.nan
curr_date_group_index = 1
for index, row in df.iterrows():
if pd.isna(row['dayofdataset_diff']) or row['dayofdataset_diff'] <= 1.0:
df.at[index, 'date_group_index'] = curr_date_group_index
else:
# Change the date group index when the current day
# is over one day past the day of the previous fire observation
curr_date_group_index += 1
df.head()
After grouping fires by date, the next step is to group the fires in a specific date group by location.
Our first thought might be to do pairwise comparison of distances between fires, using something like the Haversine Formula which computes distances between locations using their latitude and longitude values.
# Adapted from this StackOverflow post: https://stackoverflow.com/a/4913653
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # radius of earth in kilometers
return c * r
However, the size of our largest "date group" is over 20,000 rows:
max([date_group_df.shape[0] for date_group_index, date_group_df in df.groupby(['date_group_index'])])
so the number of pairwise comparisons we would need to perform is massive: $$ {23155 \choose 2} = 268065435 $$
from h3 import h3
h3
is a "hierarchical hexagonal geospatial indexing system" and divides a geograpical space into hexagons, which have nice properties including approximating circles:
We can use the h3 table of resolutions to determine the size of hexagon that will best suit our needs. Considering the maximum wildfire spread per day is 259 km, and a circle with radius $\frac {259}{2}$ has the following area:
np.pi*(259/2)**2
We will choose resolution 3, because the area of resolution 2 is too large, so 3 is the closest match without going over.
Resolution 3 has the following properties:
Average Hexagon Area (km2) | Average Hexagon Edge Length (km) | Number of unique indexes |
---|---|---|
12,392.2648621 | 59.810857940 | 41,162 |
h3_resolution = 3
Now we can assign each fire to a hexagon at this resolution:
def get_h3_address(row):
return h3.geo_to_h3(row['lat'], row['lon'], h3_resolution)
df['h3_address'] = df.apply(get_h3_address, axis='columns')
df.head()
We will use the h3 k_ring
functionality to return the set of hexagons neighboring a given hexagon address:
h3.k_ring("8348ebfffffffff", 1) # demonstrating the k_ring function
We will do this for all fires in each "date group" and assign a new "fire group" index to those fires within one hexagon of each other:
curr_max_fire_group_index = df.shape[0]
df['fire_group_index'] = np.arange(curr_max_fire_group_index)
# Iterate over every "date group"
for date_group_index, date_group_df in df.groupby(['date_group_index']):
# Get the set of all unique hexagon addresses for this date group
date_group_h3_address_set = set(date_group_df['h3_address'].unique())
# While the set of addresses is not empty compute fire groups
while len(date_group_h3_address_set) > 0:
# Pop a hexagon address from the set
h3_address = date_group_h3_address_set.pop()
# Get the addresses of the neighboring hexagons
h3_address_k_ring_set = h3.k_ring(h3_address, 1)
# Get all observations in this date group which are in this hexagon or the neighboring hexagons
h3_address_k_ring_set_df = date_group_df.loc[date_group_df['h3_address'].isin(h3_address_k_ring_set)]
# Assign these observations to the same "fire group" unless they are over a day apart
curr_max_fire_group_index += 1
curr_date = h3_address_k_ring_set_df.reset_index(drop=True).loc[0]['dayofdataset']
for fire_index_in_group in list(h3_address_k_ring_set_df.index.values):
fire_date = df.at[fire_index_in_group, 'dayofdataset']
if fire_date - curr_date > 1:
curr_max_fire_group_index += 1
curr_date = fire_date
df.at[fire_index_in_group, 'fire_group_index'] = curr_max_fire_group_index
date_group_h3_address_set = (date_group_h3_address_set - h3_address_k_ring_set)
df.head()
Later we will be able to use the fire_group_index
column to identify our groups of fire observations as single fires.
Using the NOAA Climate Data Online (CDO) service, we can obtain daily climate data from 2010 to 2016, from weather stations in the west coast. The CDO search tool allows us to select NOAA stations of interest and the types of data we want to download for a specified time period. In this case, I have manually selected 75 weather stations spread throughout the west coast region.
Along with the location (latitude and longitude) of each weather station, I selected to download data comprising the following variables:
Air Temperature
TAVG
: Average TemperatureTMAX
: Maximum temperatureTMIN
: Minimum temperaturePrecipitation
MDPR
: Multiday precipitation total (use with DAPR and DWPR, if available)DAPR
: Number of days included in the multiday precipitation total (MDPR)PRCP
: PrecipitationSNWD
: Snow depthSNOW
: SnowfallSunshine
PSUN
: Daily percent of possible sunshine for the periodTSUN
: Total sunshine for the periodWind
AWND
: Average wind speedWDF2
: Direction of fastest 2-minute windWDF5
: Direction of fastest 5-second wind WSF2
: Fastest 2-minute wind speedWSF5
: Fastest 5-second wind speedPGTM
: Peak gust timeFMTM
: Time of fastest mile or fastest 1-minute windWeather Type: binary indication of specific weather events
WT01
: Fog, ice fog, or freezing fog (may include heavy fog)WT02
: Heavy fog or heaving freezing fog (not always distinguished from fog)WT03
: ThunderWT04
: Ice pellets, sleet, snow pellets, or small hailWT05
: Hail (may include small hail)WT06
: Glaze or rime WT07
: Dust, volcanic ash, blowing dust, blowing sand, or blowing obstructionWT08
: Smoke or hazeWT09
: Blowing or drifting snowWT10
: Tornado, waterspout, or funnel cloudWT11
: High or damaging windsWT13
: MistWT14
: DrizzleWT15
: Freezing drizzleWT16
: Rain (may include freezing rain, drizzle, and freezing drizzle)WT17
: Freezing rainWT18
: Snow, snow pellets, snow grains, or ice crystalsWT19
: Unknown source of precipitationWT21
: Ground fog WT22
: Ice fog or freezing fogAfter selecting data, the CDO tool sends an email with our requested data to download. I have done this and place the downloaded file in this GitHub repository so that we can easily load it with pandas
.
cdo_df = pd.read_csv('data/1572799.csv')
cdo_df.head()
And taking note of the size of this data frame:
cdo_df.shape
To be able to match this weather data with our wildfire data, we will look at the weather from the closest station to each fire observation.
We will first need to map each station to its location.
cdo_loc_df = cdo_df.drop_duplicates(subset=['STATION'])[['STATION', 'LATITUDE', 'LONGITUDE']].set_index('STATION', drop=True)
cdo_loc_df.head()
Using the haversine
formula from above, we can write a function that returns the ID of the closest weather station and the distance (in kilometers), given the coordinates of a fire:
def get_closest_station(fire_lon, fire_lat):
station_distance_series = cdo_loc_df.apply(lambda row: haversine(row['LONGITUDE'], row['LATITUDE'], fire_lon, fire_lat), axis='columns')
return (station_distance_series.idxmin(), station_distance_series.min())
Now we can apply this function along the entire wildfire dataframe, creating two new columns:
df['closest_station'], df['closest_station_dist'] = zip(*df.apply(lambda row: get_closest_station(row['lon'], row['lat']), axis='columns'))
df.head()
The mean distance from fires to the closest weather station is approximately 79 kilometers, or 49 miles:
df['closest_station_dist'].mean()
Turning our attention to the weather dataframe now, we will convert its date values into a dayofyear
column so that we can easily cross-reference from the wildfire dataset to determine the weather on the day of a fire.
We will need to import python's datetime
to help with date formatting.
import datetime
def cdo_date_to_dayofyear(cdo_date):
date = datetime.datetime.fromisoformat(cdo_date).date()
return int(date.strftime('%-j'))
def cdo_date_to_year(cdo_date):
date = datetime.datetime.fromisoformat(cdo_date).date()
return int(date.strftime('%Y'))
Now we can apply this conversion function to all of the date values in the CDO weather dataset:
cdo_df['dayofyear'] = cdo_df['DATE'].apply(cdo_date_to_dayofyear)
cdo_df['year'] = cdo_df['DATE'].apply(cdo_date_to_year)
cdo_df['dayofdataset'] = cdo_df.apply(lambda row: (row['year'] - min_year) * 365 + row['dayofyear'], axis='columns')
Next, we can compute periods of drought.
We can perform a similar process as we did before when computing groups of dates for the wildfires, and assign unique indices to groups of consecutive observations (from the same weather station) with zero values for the PRCP
precipitation column.
cdo_df['drought_group_index'] = np.nan
curr_drought_index = 1
# For each station, compute periods of drought separately
for station_name, station_df in cdo_df.groupby(['STATION']):
# Assign a drought group index to each observation
for index, row in station_df.iterrows():
if pd.isna(row['PRCP']) or row['PRCP'] == 0.0:
cdo_df.at[index, 'drought_group_index'] = curr_drought_index
else:
# Change the index value if precipitation occurs to signify an end to the current drought
# but do not assign a drought index to this value, keep as NaN
curr_drought_index += 1
cdo_df.head()
We can create a new dataframe containing the start day, end day, and length of each drought period:
drought_df = pd.DataFrame(data=[], index=[], columns=['STATION', 'drought_group_index', 'dayofdataset_start', 'dayofdataset_end', 'length'])
# For each drought group, append a single drought observation to the drought dataframe
for drought_group_index, drought_group_df in cdo_df.groupby(['drought_group_index']):
if pd.notnull(drought_group_index):
# Compute aggregate drought data
drought_group_df_index_vals = list(drought_group_df.index.values)
drought_group_start_day = drought_group_df.loc[drought_group_df_index_vals[0]]['dayofdataset']
drought_group_end_day = drought_group_df.loc[drought_group_df_index_vals[-1]]['dayofdataset']
drought_group_length = (drought_group_end_day - drought_group_start_day)
drought_group_station = drought_group_df.loc[drought_group_df_index_vals[0]]['STATION']
if drought_group_length > 0:
# Append the row representing one drought period
drought_df = drought_df.append({
'STATION': drought_group_station,
'drought_group_index': drought_group_index,
'dayofdataset_start': drought_group_start_day,
'dayofdataset_end': drought_group_end_day,
'length': drought_group_length
}, ignore_index=True)
drought_df.head()
Finally we are ready to merge our drought data with our fire data. We will add a new column to indicate how long of a drought (if any) preceded the fire.
def get_drought_length(row):
fire_drought_df = drought_df.loc[
(drought_df['STATION'] == row['closest_station']) &
(drought_df['dayofdataset_start'] <= row['dayofdataset']) &
(drought_df['dayofdataset_end'] >= row['dayofdataset'])
]
if fire_drought_df.shape[0] == 1:
return (row['dayofdataset'] - fire_drought_df.reset_index(drop=True).loc[0]['dayofdataset_start'])
else:
return 0
df['drought_length'] = df.apply(get_drought_length, axis='columns')
df.head()
Next we can merge wind, sun, and weather type data with each fire observation. Perhaps fires are stronger or last longer during periods with high wind speed. Or fires may be more likely occur on days with thunder (since it implies lightning).
def get_weather_during_fire(row):
weather_during_fire_df = cdo_df.loc[
(cdo_df['STATION'] == row['closest_station']) &
(cdo_df['dayofdataset'] == row['dayofdataset'])
]
weather_dict = {
# Temp
'TAVG': np.nan,
'TMIN': np.nan,
'TMAX': np.nan,
# Sun
'PSUN': np.nan,
'TSUN': np.nan,
# Wind
'AWND': np.nan,
'WSF2': np.nan,
'WSF5': np.nan,
# Weather type
'WT03': np.nan,
'WT07': np.nan,
'WT08': np.nan,
'WT11': np.nan,
'WT16': np.nan
}
if weather_during_fire_df.shape[0] == 1:
weather_row = weather_during_fire_df.reset_index(drop=True).loc[0]
for weather_dict_key in weather_dict.keys():
weather_dict[weather_dict_key] = weather_row[weather_dict_key]
return pd.Series(list(weather_dict.values()), index=list(weather_dict.keys()))
matched_weather_df = df.apply(get_weather_during_fire, axis='columns')
for matched_weather_column in list(matched_weather_df.columns.values):
df[matched_weather_column] = matched_weather_df[matched_weather_column]
df.head()
Using the "fire groups" we can create a new dataset containing one observation for each group:
grouped_df_columns = [
'fire_group_index',
'fire_group_size',
'day_start',
'day_end',
'frp_mean',
'frp_sum',
'lat_min',
'lat_max',
'lon_min',
'lon_max',
'month_start',
'month_end',
'year_start',
'year_end',
'dayofyear_start',
'dayofyear_end',
'dayofdataset_start',
'dayofdataset_end',
'x_min',
'x_max',
'y_min',
'y_max',
'vpd_mean',
'temp_mean',
'humidity_mean',
'drought_length',
# Temp
'TAVG_mean',
'TMIN_min',
'TMAX_max',
# Sun
'PSUN_mean',
'TSUN_mean',
# Wind
'AWND_mean',
'AWND_min',
'AWND_max',
'WSF2_mean',
'WSF2_min',
'WSF2_max',
'WSF5_mean',
'WSF5_min',
'WSF5_max',
# Weather type
'WT03_sum',
'WT07_sum',
'WT08_sum',
'WT11_sum',
'WT16_sum'
]
grouped_df = pd.DataFrame(data=[], index=[], columns=grouped_df_columns)
# Append a single row for each "fire group"
for fire_group_index, fire_group_df in df.groupby(['fire_group_index']):
fire_group_df = fire_group_df.reset_index(drop=True)
start_row = fire_group_df.loc[0]
end_row = fire_group_df.loc[fire_group_df.shape[0]-1]
grouped_df = grouped_df.append({
'fire_group_index': fire_group_index,
'fire_group_size': fire_group_df.shape[0],
'day_start': start_row['day'],
'day_end': end_row['day'],
'frp_mean': fire_group_df['frp'].mean(),
'frp_sum': fire_group_df['frp'].sum(),
'lat_min': fire_group_df['lat'].min(),
'lat_max': fire_group_df['lat'].max(),
'lon_min': fire_group_df['lon'].min(),
'lon_max': fire_group_df['lon'].max(),
'month_start': start_row['month'],
'month_end': end_row['month'],
'year_start': start_row['year'],
'year_end': end_row['year'],
'dayofyear_start': start_row['dayofyear'],
'dayofyear_end': end_row['dayofyear'],
'dayofdataset_start': start_row['dayofdataset'],
'dayofdataset_end': end_row['dayofdataset'],
'x_min': fire_group_df['x'].min(),
'x_max': fire_group_df['x'].max(),
'y_min': fire_group_df['y'].min(),
'y_max': fire_group_df['y'].max(),
'vpd_mean': fire_group_df['vpd'].mean(),
'temp_mean': fire_group_df['temp'].mean(),
'humidity_mean': fire_group_df['humidity'].mean(),
'drought_length': start_row['drought_length'],
# Temp
'TAVG_mean': fire_group_df['TAVG'].mean(),
'TMIN_min': fire_group_df['TMIN'].min(),
'TMAX_max': fire_group_df['TMAX'].max(),
# Sun
'PSUN_mean': fire_group_df['PSUN'].mean(),
'TSUN_mean': fire_group_df['TSUN'].mean(),
# Wind
'AWND_mean': fire_group_df['AWND'].mean(),
'AWND_min': fire_group_df['AWND'].min(),
'AWND_max': fire_group_df['AWND'].max(),
'WSF2_mean': fire_group_df['WSF2'].mean(),
'WSF2_min': fire_group_df['WSF2'].min(),
'WSF2_max': fire_group_df['WSF2'].max(),
'WSF5_mean': fire_group_df['WSF5'].mean(),
'WSF5_min': fire_group_df['WSF5'].min(),
'WSF5_max': fire_group_df['WSF5'].max(),
# Weather type
'WT03_sum': fire_group_df['WT03'].sum(),
'WT07_sum': fire_group_df['WT07'].sum(),
'WT08_sum': fire_group_df['WT08'].sum(),
'WT11_sum': fire_group_df['WT11'].sum(),
'WT16_sum': fire_group_df['WT16'].sum()
}, ignore_index=True)
grouped_df.head()
We can look at the results of our grouping procedure by looking at the sizes of groups and how many groups consist of only one observation vs. how many consist of multiple observations:
# Maximum group size
grouped_df['fire_group_size'].max()
# Number of groups with one observation
grouped_df.loc[grouped_df['fire_group_size'] == 1].shape[0]
# Number of groups with multiple observations
grouped_df.loc[grouped_df['fire_group_size'] > 1].shape[0]
To begin our exploration, we can look at the distributions of some of our variables.
To visualize data, we can load helpful python visualization packages:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib import cm
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import seaborn as sns
# matplotlib notebook configuration options
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
First we will look at the distribution of the Fire Radiative Power variable. This will give us a sense of how strong the fires were between 2010 and 2016.
_ = plt.title("Distribution of Mean Fire Radiative Power")
_ = plt.xlabel("Mean Fire Radiative Power")
_ = plt.ylabel("Count")
_ = plt.hist(grouped_df['frp_mean'].values)
We can tell that the overwhelming amount of our data points come from fires with low frp
values, but there is clearly some data that is not easily visible on this plot. Perhaps if we log scale the y-axis it will be more informative.
_ = plt.title("Distribution of Mean Fire Radiative Power Mean")
_ = plt.xlabel("Mean Fire Radiative Power")
_ = plt.ylabel("log(Count)")
_ = plt.hist(grouped_df['frp_mean'].values, log=True)
This log-scaled plot allows us to see the distribution of fires with low frp
values more easily.
Looking at the distribution of the summed frp
values for each group shows a similar trend:
_ = plt.title("Distribution of Sum Fire Radiative Power")
_ = plt.xlabel("Sum Fire Radiative Power")
_ = plt.ylabel("Count")
_ = plt.hist(grouped_df['frp_sum'].values)
_ = plt.title("Distribution of Sum Fire Radiative Power Mean")
_ = plt.xlabel("Sum Fire Radiative Power")
_ = plt.ylabel("log(Count)")
_ = plt.hist(grouped_df['frp_sum'].values, log=True)
We can create these same histograms for the vpd
, temp
, and humidity
variables:
_ = plt.title("Distribution of Mean Vapor Pressure Deficit")
_ = plt.xlabel("Mean Vapor Pressure Deficit")
_ = plt.ylabel("Count")
_ = plt.hist(grouped_df['vpd_mean'].values)
This plot tells us that most of our vpd
values are between 0 and 1, and the distribution is right-skewed.
_ = plt.title("Distribution of Temperature")
_ = plt.xlabel("Temperature (degrees Kelvin)")
_ = plt.ylabel("Count")
_ = plt.hist(grouped_df['temp_mean'].values)
This plot shows that most of our temperature values are between 280 and 300, with a drop off after 300 and a left skew.
_ = plt.title("Distribution of Humidity")
_ = plt.xlabel("Humidity (percent)")
_ = plt.ylabel("Count")
_ = plt.hist(grouped_df['humidity_mean'].values)
This distribution of humidity values is centered near 50 percent, and fairly symmetric.
Next, we can look at the distribution over time using the year
variable. To create this histogram we will want to bin by year rather than allowing matplotlib to choose the bins automatically, so we can use pandas groupby
to get the counts per year explicitly:
year_df = grouped_df.groupby(['year_start']).size().reset_index(name='count')
year_df
Note that now that we have the explicit counts we can use the matplotlib bar
rather than hist
.
_ = plt.title("Distribution of Year")
_ = plt.xlabel("Year")
_ = plt.ylabel("Count")
_ = plt.bar(year_df['year_start'].values, year_df['count'].values, 0.65)
Interestingly, we see only minor differences in the number of fires per year, with a slight drop in 2016.
Next, we can look at the distribution over the year using the dayofyear
variable.
_ = plt.title("Distribution of Day of Year")
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Count")
_ = plt.hist(grouped_df['dayofyear_start'].values)
It looks like the majority of fires occur in the late summer or fall.
Maybe the distribution will be more clear if we are able to view this data by month rather than day.
Python's datetime
can be used to format our month numbers into month names by creating a date object with our month number and then specifying an output date string format that only consists of the month name.
def month_num_to_name(month_num):
date = datetime.date(2018, int(month_num), 1)
return date.strftime('%B')
We will obtain counts of fires per month using pandas groupby
.
month_df = grouped_df.groupby(['month_start']).size().reset_index(name='count')
month_df['month_start'] = month_df['month_start'].apply(month_num_to_name)
Once again we can create a plot using bar
and the explicit count values.
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Date")
_ = plt.xlabel("Month of Year")
_ = plt.ylabel("Count")
bar_width = 0.65
_ = plt.bar(month_df['month_start'].values, month_df['count'].values, bar_width)
This plot confirms what we saw above, and specifically that the months of August, September, and October see the most fires.
Finally we can look at the entire span of time from 2010 to 2016 and count the fires observed each month.
year_month_df = grouped_df.groupby(['year_start', 'month_start']).size().reset_index(name='count')
year_month_df['year-month'] = year_month_df.apply(lambda row: ("%s %i" % (month_num_to_name(row['month_start']), int(row['year_start']))), axis='columns')
year_month_df.head()
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Year and Month")
_ = plt.xlabel("Year and Month")
_ = plt.ylabel("Count")
_ = plt.bar(year_month_df['year-month'].values, year_month_df['count'].values, 0.65)
_ = plt.xticks(year_month_df['year-month'].values[::3], rotation=70)
It looks like we see a similar pattern each year. 2012 deviates from this pattern slightly, with less of an increase in fires from the summer to the fall.
Because this wildfire data also contains location data, we can view it on a map as well. We will begin our multivariate analysis by looking at the location data in relationship to other variables.
To do so, we will load python packages to assist with geographical data visualization
import cartopy.crs as ccrs
from cartopy.io.img_tiles import Stamen
imagery = Stamen(style='terrain-background')
To determine the extent (ranges) of our maps, we can determine the range of the latitude and longitude variables in the wildfire dataset.
min_lon = df['lon'].min()
max_lon = df['lon'].max()
min_lon -= (max_lon - min_lon) * 0.2
min_lat = df['lat'].min()
max_lat = df['lat'].max()
For our visualization, we will plot fires from different years using different colors. To do so, matplotlib's colormap module can be used to map each year to a color of the rainbow.
We will make this conventient for ourselves later by creating a function that will generalize this mapping to any sorted array.
def get_cmap_dict(vals, cmap_name='rainbow', numeric=True):
cmap = cm.get_cmap(cmap_name)
if numeric:
# assume vals already sorted
min_val = vals[0]
max_val = vals[-1]
return dict(zip(vals, [cmap((val - min_val) / (max_val - min_val)) for val in vals]))
else:
len_vals = len(vals)
return dict(zip(vals, [cmap(val / len_vals) for val in range(len_vals)]))
years = list(df['year'].unique())
year_to_color_map = get_cmap_dict(years)
Because the plot could become cluttered and difficult to interpret, we will first restrict the fires to those with the top strength (frp
) for each year.
n_per_year = 30
df_top_frp = df.sort_values(['frp'], ascending=False)
df_top_frp_year_groupby = df_top_frp.groupby(['year'])
Next we will need to determine the minimum and maximum strength values, which will help us to make a legend.
min_frp = df_top_frp_year_groupby.min()['frp'].min()
max_frp = df_top_frp_year_groupby.max()['frp'].max()
frp_intervals = np.linspace(max_frp, min_frp, 4, endpoint=False)
Now we will create the plot.
grouped_df_top_frp = grouped_df.sort_values(['frp_sum'], ascending=False)
grouped_df_top_frp_year_groupby = grouped_df_top_frp.groupby(['year_start'])
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(1, 1, 1, projection=imagery.crs)
_ = ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
#_ = ax.add_image(imagery, 8)
frp_scaling_factor = 0.003
for year, df_top_frp_year in grouped_df_top_frp_year_groupby:
for group_index, group_row in df_top_frp_year.head(n_per_year).iterrows():
for fire_index, fire_row in df.loc[df['fire_group_index'] == group_row['fire_group_index']].sort_values(['frp'], ascending=False).head(100).iterrows():
_ = plt.plot(
fire_row['lon'],
fire_row['lat'],
marker='o',
color=year_to_color_map[fire_row['year']],
markersize=(frp_scaling_factor*fire_row['frp']),
alpha=0.5,
transform=ccrs.Geodetic()
)
_ = ax.set_title("West Coast Wildfires 2010-2016, Strongest %i per Year" % n_per_year)
year_legend_elements = [Patch(facecolor=year_color, label=year) for year, year_color in year_to_color_map.items()]
year_legend = ax.legend(handles=year_legend_elements, loc='upper left', title='Year')
_ = plt.gca().add_artist(year_legend)
frp_legend_elements = [Line2D([0], [0], marker='o', color=(0,0,0,0), label=np.floor(frp_val), markerfacecolor='#C0C0C0', markersize=frp_val*frp_scaling_factor) for frp_val in frp_intervals]
_ = ax.legend(handles=frp_legend_elements, loc='lower left', labelspacing=2, columnspacing=2, title='Fire Radiative Power')
And the same plot as above, uncommenting the add imagery line to show the terrain:
grouped_df_top_frp = grouped_df.sort_values(['frp_sum'], ascending=False)
grouped_df_top_frp_year_groupby = grouped_df_top_frp.groupby(['year_start'])
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(1, 1, 1, projection=imagery.crs)
_ = ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
_ = ax.add_image(imagery, 8)
frp_scaling_factor = 0.003
for year, df_top_frp_year in grouped_df_top_frp_year_groupby:
for group_index, group_row in df_top_frp_year.head(n_per_year).iterrows():
for fire_index, fire_row in df.loc[df['fire_group_index'] == group_row['fire_group_index']].sort_values(['frp'], ascending=False).head(100).iterrows():
_ = plt.plot(
fire_row['lon'],
fire_row['lat'],
marker='o',
color=year_to_color_map[fire_row['year']],
markersize=(frp_scaling_factor*fire_row['frp']),
alpha=0.5,
transform=ccrs.Geodetic()
)
_ = ax.set_title("West Coast Wildfires 2010-2016, Strongest %i per Year" % n_per_year)
year_legend_elements = [Patch(facecolor=year_color, label=year) for year, year_color in year_to_color_map.items()]
year_legend = ax.legend(handles=year_legend_elements, loc='upper left', title='Year')
_ = plt.gca().add_artist(year_legend)
frp_legend_elements = [Line2D([0], [0], marker='o', color=(0,0,0,0), label=np.floor(frp_val), markerfacecolor='#C0C0C0', markersize=frp_val*frp_scaling_factor) for frp_val in frp_intervals]
_ = ax.legend(handles=frp_legend_elements, loc='lower left', labelspacing=2, columnspacing=2, title='Fire Radiative Power')
Next we can perform multivariate analysis of the rest of our data and ask ourselves whether we see relationships between any of our variables.
In the following plot we will look at temperature over time, with different colors for each year.
Next, we can incorporate year by coloring each point based on year.
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear_start", y="temp_mean", hue="year_start", data=grouped_df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")
year_month_mean_df = grouped_df.groupby(['year_start', 'month_start']).mean().reset_index()
month_tick_formatter = matplotlib.ticker.FuncFormatter(lambda x, p: month_num_to_name(int(x)) if x >= 1 and x <= 12 else "")
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Month")
ax = sns.lineplot(x="month_start", y="temp_mean", hue="year_start", data=year_month_mean_df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Mean Temperature (kelvin)")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
Because we have many more variable pairs to evaluate, we can start to look at correlations.
corr_df = grouped_df.corr()
_ = plt.figure(figsize=(20,20))
_ = sns.heatmap(corr_df, annot=True, vmin=-1, vmax=1)
_ = plt.title("Correlation Matrix")
From this matrix, we can tell that many of our variables do not have a strong linear relationship.
However, there are a few correlations to note, which we should inspect in more detail:
WT03_sum
(thunder) and fire_group_size
WT03_sum
(thunder) and frp_sum
As well as some expected correlations:
frp_sum
and fire_group_size
: A larger "fire group" is comprised of more fire observations and thus more frp
valuesx
and lon
: The values for x
are just converted lon
valuesy
and lat
: The values for y
are just converted lat
valuesWT08_sum
(smoke) and fire_group_size
: larger fires generate more smokeWT08_sum
(smoke) and frp_sum
: larger fires generate more smokeTAVG
and lat
: climate changes based on distance from equatorTAVG
and temp
: both are measures temperature, but one comes from the fire data and the other comes from the weather datavpd_mean
and temp_mean
/humidity_mean
: vapor pressure deficit is a function of temperature and humidityWe can also tell from this heatmap that all of our sun weather observations (PSUN
, TSUN
) are NaN.
Now we will look at some relationships of interest:
We can use seaborn
's regplot
function to easily look at scatterplots containing a line of best fit and corresponding confidence interval:
fig, axes = plt.subplots(4, 1, figsize=(6, 20))
_ = sns.regplot(x="WT03_sum", y="frp_sum", data=grouped_df, ax=axes[0])
_ = axes[0].set_xlabel("Thunder")
_ = axes[0].set_ylabel("Sum Fire Radiative Power")
_ = axes[0].set_title("Sum Fire Radiative Power vs Thunder")
_ = sns.regplot(x="WT16_sum", y="frp_sum", data=grouped_df, ax=axes[1])
_ = axes[1].set_xlabel("Rain")
_ = axes[1].set_ylabel("Sum Fire Radiative Power")
_ = axes[1].set_title("Sum Fire Radiative Power vs Rain")
_ = sns.regplot(x="WT11_sum", y="frp_sum", data=grouped_df, ax=axes[2])
_ = axes[2].set_xlabel("High Winds")
_ = axes[2].set_ylabel("Sum Fire Radiative Power")
_ = axes[2].set_title("Sum Fire Radiative Power vs High Winds")
_ = sns.regplot(x="WT07_sum", y="frp_sum", data=grouped_df, ax=axes[3])
_ = axes[3].set_xlabel("Dust")
_ = axes[3].set_ylabel("Sum Fire Radiative Power")
_ = axes[3].set_title("Sum Fire Radiative Power vs Dust")
This plot shows that of the fires with higher summed fire radiative power values, we see more instances of thunder. There is a slightly weaker relationship with rain and winds, and no relationship with dust.
And as expected, a very strong relationship with smoke:
_ = sns.lmplot(x="WT08_sum", y="frp_sum", data=grouped_df)
_ = plt.xlabel("Smoke")
_ = plt.ylabel("Sum Fire Radiative Power")
_ = plt.title("Sum Fire Radiative Power vs Smoke")
And a closer look at some of the additional pairs of variables which do not have a clear linear relationship:
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Radiative Power by Humidity")
_ = sns.scatterplot(x="humidity_mean", y="frp_mean", data=grouped_df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Radiative Power by Vapor Pressure Deficit")
_ = sns.scatterplot(x="vpd_mean", y="frp_mean", data=grouped_df)
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Radiative Power by Month")
ax = sns.scatterplot(x="month_start", y="frp_mean", data=grouped_df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Fire Radiative Power")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Month")
ax = sns.scatterplot(x="month_start", y="humidity_mean", data=grouped_df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Humidity")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Radiative Power by Preceding Drought Length")
_ = sns.scatterplot(x="drought_length", y="frp_mean", data=grouped_df)
_ = plt.xlabel("Preceding Drought Length")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Group Size by Preceding Drought Length")
_ = sns.scatterplot(x="drought_length", y="fire_group_size", data=grouped_df)
_ = plt.xlabel("Preceding Drought Length")
_ = plt.ylabel("Fire Group Size")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Group Size by Average Wind Speed")
_ = sns.scatterplot(x="AWND_mean", y="fire_group_size", data=grouped_df)
_ = plt.xlabel("Average Wind Speed")
_ = plt.ylabel("Fire Group Size")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Fire Group Size by Max Wind Speed")
_ = sns.scatterplot(x="AWND_max", y="fire_group_size", data=grouped_df)
_ = plt.xlabel("Max Wind Speed")
_ = plt.ylabel("Fire Group Size")
To perform hypothesis testing, we can load the statsmodels package, which contains functions for statistical analysis including t-testing, linear regression, ANOVA, and much more.
import statsmodels.api as sm
First we might want to test the hypothesis that a longer period of drought preceding a fire results in a higher powered fire.
To perform this test, we can set our null and alternative hypotheses. The null hypothesis will be that the coefficient of the linear model is not different from zero. $$ \text{H}_0 : \beta_1 = 0 \\ \text{H}_a : \beta_1 \neq 0 $$
X = grouped_df[['drought_length']].values
X = sm.add_constant(X)
y = grouped_df['frp_mean'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
At significance level $\alpha = 0.05$, we cannot reject our null hypothesis that the length of drought preceding a fire has a relationship with the strength of the fire. In addition, we get an extremely low R-squared value, such that it rounds down to zero, meaning that this model explains approximately none of the variation in our data.
Next, we can test whether wind speed has any relationship to the size of the group of fire observations. Perhaps with higher wind speed a fire is able to grow larger.
We again need to set our null and alternative hypotheses. This will be the same as the previous model, in that we are testing whether we have evidence that the coefficient of the linear model is different from zero. $$ \text{H}_0 : \beta_1 = 0 \\ \text{H}_a : \beta_1 \neq 0 $$
grouped_df_filtered = grouped_df.dropna(subset=['AWND_max','fire_group_size'])
X = grouped_df_filtered[['AWND_max']].values
X = sm.add_constant(X)
y = grouped_df_filtered['fire_group_size'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
At significance level $\alpha = 0.05$, we can reject our null hypothesis. This means that we have significant evidence that there is a linear relationship between maximum wind speed and the size of the fire group. However, our R-squared value is once again very low, so the model does not explain our data very well.
Finally, we will test for a linear relationship between humidity and fire radiative power. Our null hypothesis is once again that the coefficient of the regression will be equal to zero.
grouped_df_filtered = grouped_df.dropna(subset=['humidity_mean','frp_mean'])
X = grouped_df_filtered[['humidity_mean']].values
X = sm.add_constant(X)
y = grouped_df_filtered['frp_mean'].values
model = sm.OLS(y, X)
results = model.fit()
results.summary()
At significance level $\alpha = 0.05$, we can reject our null hypothesis, so we have evidence that fire radiative power does vary with humidity. But once again the R-squared value is very low.
Although our linear regression results have been poor, perhaps with the incorporation of additional features, a classifier will be able to distinguish between classes of fires or a regressor will be able to predict characteristics of fires.
We do not really have any categorical variables in our dataset to be able to perform a classification task, but we can stretch a little bit and hypothesize that perhaps we can predict the year of a fire from its observed characteristics, treating year as categorical. We might want to treat the year as categorical given that different years may have different climates.
To perform classification, we will first need to clean up our input array and remove rows with NaN feature values.
feature_columns = [
'fire_group_size',
'frp_mean',
'frp_sum',
'dayofyear_start',
'dayofyear_end',
'vpd_mean',
'temp_mean',
'humidity_mean',
'drought_length'
]
response_column = 'year_start'
classification_columns = feature_columns + [response_column]
classification_df = grouped_df[classification_columns].loc[pd.notnull(grouped_df[feature_columns]).all(axis=1)]
num_rows = grouped_df.shape[0]
num_na = grouped_df[feature_columns].loc[pd.isnull(grouped_df[feature_columns]).any(axis=1)].shape[0]
num_not_na = classification_df.shape[0]
print("Number of Rows with NaN:\t%i" % num_na)
print("Number of Rows without NaN:\t%i" % num_not_na)
print("Total Number of Rows:\t\t%i" % num_rows)
To treat the year as categorical, we will one-hot encode the year
variable using the get_dummies
function of pandas
.
classification_df[response_column] = classification_df[response_column].astype(int)
y = pd.get_dummies(classification_df[response_column]).values
y_for_kfold = classification_df[response_column].values # also keep a version of y which is not one-hot encoded
Now that we have our y
response array, we will create our X
array of features.
X = classification_df[feature_columns].values
Now we can import a classifier and associated utility functions from scikit-learn. The first utility to import is a helpful model selection function called StratifiedKFold. This will allow us to split our data into folds while taking into account the proportion of data from each year the dataset.
from sklearn.model_selection import StratifiedKFold
Next we will import a random forest classifier.
from sklearn.ensemble import RandomForestClassifier
For each of 10 folds, we will perform classification and then evaluate our classifier on the test set. We will append each score to an array.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
scores = []
for train_index, test_index in skf.split(X, y_for_kfold):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y_for_kfold[train_index], y_for_kfold[test_index]
clf = RandomForestClassifier(n_estimators=50, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
scores.append(clf.score(X_test, y_test))
scores
We can now plot the distribution of scores:
_ = sns.violinplot(scores, orient="v")
_ = plt.ylim(0.0, 1.0)
_ = plt.title("Random Forest Score Distribution")
_ = plt.ylabel("Score")
Unfortunately, the random forest was not able to accurately predict year of fire from the features we provided. As a future project you could perform parameter tuning or try a different classifier, and these scores may improve.
Here, we will try to perform regression using multiple features.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
We will use a random forest regressor to try to predict the Fire Radiative Power from our other variables.
We will again need to clean up our input array and remove rows with NaN feature values.
feature_columns = [
'fire_group_size',
'dayofyear_start',
'dayofyear_end',
'vpd_mean',
'temp_mean',
'humidity_mean',
'drought_length',
'AWND_mean',
'AWND_min',
'AWND_max',
'WSF2_mean',
'WSF2_min',
'WSF2_max',
'WSF5_mean',
'WSF5_min',
'WSF5_max',
'WT03_sum',
'WT07_sum',
'WT08_sum',
'WT11_sum',
'WT16_sum'
]
response_column = 'frp_mean'
regression_columns = feature_columns + [response_column]
regression_df = grouped_df[regression_columns].loc[pd.notnull(grouped_df[feature_columns]).all(axis=1)]
num_rows = grouped_df.shape[0]
num_na = grouped_df[feature_columns].loc[pd.isnull(grouped_df[feature_columns]).any(axis=1)].shape[0]
num_not_na = regression_df.shape[0]
print("Number of Rows with NaN:\t%i" % num_na)
print("Number of Rows without NaN:\t%i" % num_not_na)
print("Total Number of Rows:\t\t%i" % num_rows)
We will create our X
and y
arrays.
X = regression_df[feature_columns].values
y = regression_df[response_column].values
kf = KFold(n_splits=5, random_state=0)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
regr = RandomForestRegressor(max_depth=10, random_state=0, n_estimators=100)
regr.fit(X_train, y_train)
scores.append(regr.score(X_test, y_test))
scores
_ = sns.violinplot(scores, orient="v")
_ = plt.ylim(0.0, 1.0)
_ = plt.title("Random Forest Score Distribution")
_ = plt.ylabel("Score")
Unfortunately, all of this analysis has not really resulted in many new insights or inferences we can make about wildfires and the relationship to weather. Our initial regression and classification results were very poor. Perhaps if the fire data had spanned a longer period of time we would have been able to see trends more clearly. Or instead there might be inherent limitations in the fact that the wildfire data is derived from satellite imagery. Or the takeaway might simply be that there is no relationship between fires and weather.
But our analysis does support the idea that, at least for the west coast of the United States, wildfires occur spontaneously and from human activity such as machinery sparks, cigarette butts, or arson.
And regardless of the results, this tutorial has demonstrated the data lifecycle: