CyberGIS Link:
#Importing Modules
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
from matplotlib import pyplot as plt
import seaborn as sb
Downloaded from: https://geo.wa.gov/datasets/df827f607eb347d49a6cca071ce66d5e_11/explore
This data is sourced through Washington's Geospatial Open Data Portal. The shapefile provides an outline of the state and its major water resource inventory areas (i.e. watersheds). More information about how these boundaries were determined can be found here: https://geo.wa.gov/datasets/waecy::water-resource-inventory-areas-wria/about
#The WRIA shapefile is loaded into a geopandas dataframe as 'wria'
wria=gpd.read_file('WRIA.shp')
wria.describe
ax = wria.plot()
Downloaded from: https://apps.ecology.wa.gov/eim/search/SMP/RiverStreamSearch.aspx?MPLocationMPID=4
The Washington State Department of Ecology provides data for river/stream locations throughout the state in csv format. These datasets provide a large variety of information related to river flow, water quality, temperature, turbidity, and more. They also vary in the amount of time that their measurements cover. For the purpose of this project, only stations with long term and continuous runoff data (at least circa 500 flow measurements) spanning a minimum of four decades were considered. The initial selection of stations to include in this analysis was performed manually using the overview site provided here: https://apps.ecology.wa.gov/eim/search/SMP/RiverStreamSingleStationOverview.aspx?LocationUserIds=05B110&ResultType=RiverStreamOverviewList
Flow is provided in cubic feet per second (cfs), which represents the volume of water passing through a section of a stream at a particular time. Because of this, cfs can vary based on where on a river the measurement is taken. Cfs is defined by usgs.gov as "A unit expressing rates of discharge. One cubic foot per second is equal to the discharge of a stream of rectangular cross section, 1 foot wide and 1 foot deep, flowing water an average velocity of 1 foot per second." (https://water.usgs.gov/wsc/glossary.html#Cubicfeetpersecond)
When requesting data from the FIN, all watersheds were selected. However, the download portal limits the amount of data that may be requested at one time. For this reason, data was requested in two separate orders based on Monitoring Region: one for Northwest and Southwest WA, and one for Central and Eastern WA. The request was filtered further by setting the Location Status to "Active" and the Location Setting to "Long-term".
Data from these runoff stations were imported, filtered, and analyzed further within this project. A full overview of the stations that were finally selected to represent each watershed is provided below.
#Reading in the River Stream Results csv for eastern and western WA as pandas dataframes.
east=pd.read_csv('RiverStreamResults_2024Jul27_200373.csv') #Central & East WA
west=pd.read_csv('RiverStreamResults_2024Jul29_275611.csv') #Northwest & Southwest WA
riv=pd.concat([west,east], ignore_index=True) #Concatenating West and East data into one dataframe
#The River Stream Location Details provides more information about each station, including a description of their location.
#River details are read in for East and West, then concatenated.
eastDetails=pd.read_csv('RiverStreamLocationDetails_2024Jul27_24.csv') #Central & East WA
westDetails=pd.read_csv('RiverStreamLocationDetails_2024Jul29_35.csv') #Northwest & Southwest WA
rivDetails=pd.concat([westDetails,eastDetails], ignore_index=True)
rivDetails
#The riv dataset includes a variety of measurements. For this project, we filter out all data except that for river flow.
riv=riv[riv['Result_Parameter_Name']=='Flow']
#Creating a 'stations' dataframe. This drops duplicates from riv to show the Location_ID, Location Name and Lat/Lon of the individual stations.
stations=riv[['Location_ID','Location_Name','Calculated_Latitude_Decimal_Degrees_NAD83HARN','Calculated_Longitude_Decimal_Degrees_NAD83HARN']].drop_duplicates()
stations=stations.sort_values(by='Location_ID')
stations
#Based on a manual audit within the FIN portal, certain stations did not meet criteria for long-term flow data. These are dropped here.
#More details regarding how these were selected can be found in the data description above and in the project report.
to_drop=['03B050','05A090','05A110','05B110','07C070','07D050','09A080','16C090','24F070','27D090','34A170','37A205','48A075','49A070']
stations=stations[~stations['Location_ID'].isin(to_drop)] #~ flips a True statement to False, meaning we keep any rows that are not in the to_drop list.
stations=stations.drop([360,343]) #These stations were missed in the previous drop_duplicates function.
stations
print ('Total number of stations: ' + str(len(stations)))
#The stations dataset was transformed into a geopandas dataframe based on their latitude and longitude.
stations=gpd.GeoDataFrame(stations, geometry=gpd.points_from_xy(stations.Calculated_Longitude_Decimal_Degrees_NAD83HARN,
stations.Calculated_Latitude_Decimal_Degrees_NAD83HARN))
#The sortdata function is defined here. This prepares the data in the riv dataframe for analysis.
def sortdata(loc):
stations=riv[riv['Location_ID']==loc] #The station is defined by Location ID and selected from the riv dataframe
stations=stations[['Field_Collection_Start_Date','Result_Value']]
#Only the start date and measurement result of the measurement are of interest. Almost all measurements start and end on the same day
stations['Field_Collection_Start_Date']=pd.to_datetime(stations['Field_Collection_Start_Date']) #Start date data is transformed to datetime type
stations=stations.sort_values(by='Field_Collection_Start_Date')
stations=stations.set_index('Field_Collection_Start_Date') #Measurements are sorted chronologically and date is set as the index
stations=stations.rename(columns={'Result_Value':str(loc)}) #The Result column is renamed based on the location ID
stations=stations.resample('ME').mean() #Most stations only have one or two monthly values.
#The resample function groups measurements by month to determine the monthly mean. This allows for easier comparison with monthly runoff at other stations
return stations
#This for loop runs each station in the stations list through the sort data function
dict={} #All flow dataframes for each station will be collected within this dictionary
for x in list(stations['Location_ID']):
dict[f's{x}'] = sortdata(x)
#The coordinate reference systems for both geopandas dataframes are set to WGS84 (4326) so they can be plotted together.
stations=stations.set_crs('4326')
wria=wria.to_crs('4326')
map = wria.plot(figsize=(16, 12), label='WRIA_NM')
stations.plot(ax=map, marker='+', color='white',markersize=80)
for x, y, label in zip(wria.geometry.centroid.x, wria.geometry.centroid.y, wria['WRIA_NM']):
map.annotate(label, xy=(x, y), fontsize=8, ha='center') #Plotting annotations for each WRIA
plt.show()
#This plot shows the locations of runoff measurement stations throughout WA with white markers.
The above plot shows the selected stations with sufficient data and their placement within Washington's WRIA. We see that some watersheds do not have any measurements stations, while others have multiple.
Below, the runoff timeseries of several stations are plotted along with other's within the same watershed. This serves two purposes.
plt.figure(figsize=(12,4))
dict['s07A090']['07A090'].plot(label='Snohomish River at Snohomish')
dict['s07D130']['07D130'].plot(label='Snoqualmie River at Snoqualmie')
plt.legend()
plt.ylabel('cfs')
plt.title('Snohomish')
#Selecting the station at Snohomish to represent runoff as Snoqualmie River has lower runoff values and more gaps.
plt.figure(figsize=(12,4))
dict['s23A070']['23A070'].plot(label='at Porter')
dict['s23A160']['23A160'].plot(label='at Dryad')
plt.legend()
plt.ylabel('cfs')
plt.title('Upper Chehalis')
#Selecting the runoff at Porter to represent the Upper Chehalis watershed as it has a longer runoff series than that at Dryad.
plt.figure(figsize=(12,4))
dict['s45A070']['45A070'].plot(label='at Wenatchee')
dict['s45A110']['45A110'].plot(label='near Leavenworth')
plt.legend()
plt.ylabel('cfs')
plt.title('Wenatchee')
#The runoff stations for the Wenatchee River do show impressive visual correlation with each other.
#We can assume that the extreme runoff value at Leavenworth is likely erroneous; therefore, the Wenatchee station may be more reliable.
#For all other WRIAs that had more than one station, these were compared in a similar fashion to select one representative for that WRIA.
#The remaining stations are dropped here
to_drop=['01A050','05B070','07D130','08C110','23A160','34B110','45A110','49A190']
stations=stations[~stations['Location_ID'].isin(to_drop)]
stations
01A120 at North Cedarville
03A060 Skagit River near Mt Vernon
04A100 Skagit River at Marblemount
05A070 Stillaguamish River near Silvana
07A090 Snohomish at Snohomish
08C070 Cedar River at Logan St/Renton
09A190 Green River at Kanaskat
10A070 Puyallup River at Meridian St
11A070 Nisqually River at Nisqually
13A060 Deschutes River at East St Bridge
16A070 Skokomish River near Potlach
18B070 Elwah River near Port Angeles
https://apps.ecology.wa.gov/eim/search/Detail/Detail.aspx?DetailType=Location&SystemStationId=174040
20B070 Hoh River at DNR Campground
22A070 Humptulips River near Humptulips
https://apps.ecology.wa.gov/eim/search/Detail/Detail.aspx?DetailType=Location&SystemStationId=115040
23A070 Chehalis river at Porter
24B090 Willapa River near
26B070 Cowlitz River at Kelso
27B070 Kalama River near Kalama
https://apps.ecology.wa.gov/eim/search/Detail/Detail.aspx?DetailType=Location&SystemStationId=102140
32A070 Walla Walla River near Touchet
34A070 Palouse River at Hooper
35B060 Tuannon River at Powers
https://apps.ecology.wa.gov/eim/search/Detail/Detail.aspx?DetailType=Location&SystemStationId=364640
36A070 Columbia River near Vernita
37A090 Yakima River at Kiona
41A070 Crab Creek near Beverly
45A070 Wenatchee river at Wenatchee
46A070 Entiat River near Entiat
48A140 Methow River at Twisp
49B070 Similkameen River at Oroville
53A070 Columbia River at Grand Coulee
54A120 Spokane River at Riverside State Park
56A070 Hangman Creek at Mouth
57A150 Spokane River at Stateline Bridge
https://apps.ecology.wa.gov/eim/search/Detail/Detail.aspx?DetailType=Location&SystemStationId=222740
60A070 Kettle River near Barstow
61A070 Columbia River at Northport
Pictures are sourced via this search function: https://apps.ecology.wa.gov/eim/search/SMP/RiverStreamSingleStationOverview.aspx?FocusTab=True&ResultType=RiverStreamOverviewList&RiverStreamSearchResults&LocationUserIds=05B110&LocationUserIdSearchType=Equals&LocationUserIDAliasSearchFlag=True
#The sortdata function is run again on all remaining stations for further analysis
dict={}
for x in list(stations['Location_ID']):
dict[f's{x}'] = sortdata(x)
print ('Total number of stations: ' + str(len(stations)))
dict
#In order to run a correlation matrix among all stations in the dictionary, their dataframes need to be concatenated.
runoff=pd.concat(dict, axis=1)
runoff.columns=list(dict.keys()) #This resets their column name to represent the stations
runoff
#Because of the previous groupby month function, the monthly averages are compared for each station.
#NaNs represent months where certain stations did not have any measurements available. These will not be considered for the correlation.
#Running a seaborn correlation matrix on all runoff data
plt.figure(figsize=(15,15))
sb.heatmap(runoff.corr(), annot=False, fmt=".2f", cmap='Blues', vmin=0, vmax=1, square=True,linewidth=1)
The correlation coefficient is a simple statistical measurement that describes the linear relationship between two variables, i.e. the relationship between all runoff measurements at one station with those at another. While it can be used to represent a relationship, correlation alone cannot indicate cause and effect. According to the pandas documentation of DataFrame.corr, the corr function, "compute[s] pairwise correlation of columns, excluding NA/null values." https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
The Seaborn correlation matrix plots the correlation of all stations with all other stations using a heatmap. The diagonal line of 1.0 correlation values are to be expected as they represent the correlation of a station with itself. Since stations are numbered in a semi-spatial order, we can recognize higher correlation with "nearby" stations.
The next steps incorporate this information into our WRIA plot to better visualize this spatial correlation.
#Creating a new geopandas dataframe from wria that only includes those watersheds that have runoff data (and thereby, correlation data) available
wria_corr=wria
nrs=[1,3,4,5,7,8,9,10,11,13,16,18,20,22,23,24,26,27,32,34,35,36,37,41,45,46,48,49,53,54,56,57,60,61]
wria_corr=wria_corr[wria_corr['WRIA_NR'].isin(nrs)]
wria_corr=wria_corr.sort_values(by='WRIA_NR')
#This corrwith function calculates correlation for the station 01A120 with all other stations in the dataframe
corr=runoff.corrwith(runoff['s01A120'])
wria_corr['Correlation']=corr.values
#Plotting the correlation as a cloropleth
fig, ax = plt.subplots(figsize=(10, 6))
wria.plot(ax=ax, color='lightgrey')
wria_corr.plot(wria_corr['Correlation'], ax=ax, cmap='BuPu', legend=True)
stations[stations['Location_ID']=='01A120'].plot(ax=ax, marker='P', color='lightblue',markersize=80, label='01A120 - Nooksack')
stations[stations['Location_ID']=='03A060'].plot(ax=ax, marker='P', color='green',markersize=80, label='03A060 - Skagit')
stations[stations['Location_ID']=='48A140'].plot(ax=ax, marker='P', color='red',markersize=80, label='48A140 - Methow')
plt.legend()
plt.title('Correlation of Monthly Runoff Values with Station 01A120')
#Let's look at a 10 year time series for runoff at the Nooksack and Skagit rivers, which have a high correlation...
plt.figure(figsize=(12,4))
runoff['s01A120'].loc[((runoff.index >= '1995-01-01') & (runoff.index < '2015-01-01'))].plot(label='01A120 - Nooksack', color='blue')
runoff['s03A060'].loc[((runoff.index >= '1995-01-01') & (runoff.index < '2015-01-01'))].plot(label='03A060 - Skagit', color='green')
#vs. the Methow River, which has a low correlation with the Nooksack
runoff['s48A140'].loc[((runoff.index >= '1995-01-01') & (runoff.index < '2015-01-01'))].plot(label='48A140 - Methow', color='red')
plt.legend()
plt.ylabel('cfs')
plt.title('Runoff')
#Defining the stationcorr function, which produces a cloropleth map of correlation for a given station
def stationcorr(s):
corr=runoff.corrwith(runoff[s]) #The corrwith function is run on the given station label,
wria_corr['Correlation']=corr.values #which is used to redefine the 'Correlation' column of the wria geopandas dataframe
fig, ax = plt.subplots(figsize=(10, 6)) #A plot is set up to plot the Correlation values
wria.plot(ax=ax, color='lightgrey') #This plots the original wria with all stations underneath the cloropleth.
#Those WRIAs without runoff data will be grey.
wria_corr.plot(wria_corr['Correlation'], ax=ax, cmap='BuPu', legend=True)
m=s[1:] #This marks the location of the station for correlation
stations[stations['Location_ID']==m].plot(ax=ax, marker='+', color='white',markersize=80)
plt.title('Correlation of Monthly Runoff Values with Station '+m)
return plt
Now, any station name can be entered into the stationcorr() function to create a correlation cloropleth across Washington.
The WRIAs provide a suitable area to visualize the spatial correlation between runoff measurement stations. However, within this project each WRIA is only represented by a single station within it's bounds. Because of the nature of rivers and flow measurements, runoff data and the derived correlation is highly variable and cannot be considered uniform across all WRIAs.
stationcorr('s34A070')
Station 41A070 (Crab Creek near Beverly) is notable in that it shows extremely low and the only negative correlation with other stations across the state. This is seen both in the seaborn heatmap as well as in the cloropleths.
The reason for this lack of correlation is not immediately clear. However, when we plot this runoff with the runoff in the Columbia River, which is one of Washington's major rivers and immediately south of station 41A070, we see that the river flow at Lower Crab is neglible to that of the nearby station. Additionally, Crab Creek serves as an irrigation drain into the Columbia River and only flows intermittently from agricultural runoff, not natural water sources. This is a result of the Columbia Basin Project of the last century, during which the Grand Coulee Dam was built to provide hydroelectric power and irrigation water across central Washington. More information can be found here: https://www.usbr.gov/pn/grandcoulee/cbp/index.html
stationcorr('s41A070')
plt.figure(figsize=(12,4))
runoff['s41A070'].loc[((runoff.index >= '1995-01-01') & (runoff.index < '2015-01-01'))].plot(label='410A70 - Lower Crab Creek', color='blue')
runoff['s36A070'].loc[((runoff.index >= '1995-01-01') & (runoff.index < '2015-01-01'))].plot(label='360A70 - Alkali Squilchuck', color='red')
plt.legend()
plt.ylabel('cfs')
plt.title('Runoff')
#Running this cell will run the stationcorr function for all stations!
#runoff.columns
#for x in runoff.columns:
# stationcorr(x)
Based on this initial analysis of runoff correlation across measurement stations in Washington, we can assume a degree of spatial correlation for river runoff in nearby watersheds.
The initial project proposal also suggested looking at the changes in runoff over time. However, calculation of decadal runoff means for each station, as well as a visual analysis of runoff time series, do not show any noteworthy changes since the 1970's. Further research may consider looking at flooding events or correlation with precipitation to better understand the factors affecting runoff and how these may be changing.
Additionally, the correlation analysis could be further expanded upon by considering certain extremely high or low runoff events, bringing in data from more stations and other variables (i.e. precipitation), and comparing runoff at different stations along the same river.