Procedure for building files requested by NOAA CO-OPS. Provided by Tim Hunter Updated February 10, 2021 Step 1: Retrieve raw station data from NCEI. We use two data sets as input. A) The Global Historical Climatology Network Daily https://www.ncdc.noaa.gov/ghcn-daily-description *Revised in February 2021 The old instructions made use of the all-inclusive large tar file ghcnd_all.tar.gz in order to simplify the process. But I now recommend using the master list option as described below. ----------------------------------------------------------------------------------- Method 1 (master list with station-specific downloading): 1) Edit your master list file as needed. This might mean adding a new station that you have decided you want to use, or deleting some, or updating the excluded years because you have seen something in the data timeseries. 2) Use the Python code to extract and build met_*.csv files. Note that the python script will auto-detect the format of the station list. This python script will only download the specific stations that are listed. mkdir work cd work python ../process_ghcnd_by_stn.py ../master_list.txt ----------------------------------------------------------------------------------- Method 2 (master list; using the giant tarball file): 1) Download the giant compressed tarball from NCDC named ghcnd_all.tar.gz Current location as of 1/12/2021 on NCDC ftp site is: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ Current size as of 1/12/2021 is 3.3 GB. 2) Edit your master list file as needed. This might mean adding a new station that you have decided you want to use, or deleting some, or updating the excluded years because you have seen something in the data timeseries. 3) Uncompress the big tarball using whatever tool you wish. Do NOT extract the files. That will be done by the python code. Most (all?) linux systems will have gunzip installed. On a PC you might use something like 7-zip or another app. gunzip ghcnd_all.tar.gz 4) Use the Python code to extract and build met_*.csv files. Note that the python script will auto-detect the format of the station list. mkdir work cd work python ../process_ghcnd.py ../ghcnd_all.tar ../master_list.txt ----------------------------------------------------------------------------------- Method 3 (using giant tarball file and NCEI-supplied station history file): This would still work, but I only recommend it if you are kind of "starting over" with station selection. You will lose the ability to filter out years of bad data, etc. But if you want to see all available station data, this is the way to do that. The following set of instructions come from notes to myself. Be aware that I generally do most of this processing on our linux system, so commands/tools may be just slightly different if you are working on PCs only: 1) Download the giant compressed tarball from NCDC named ghcnd_all.tar.gz Current location as of 1/12/2021 on NCDC ftp site is: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ Current size as of 1/12/2021 is 3.3 GB. 2) Download the ghcnd-stations.txt file from NCDC. Same location as above. [Optional, but highly recommended step] Edit the file so that it only contains entries for stations in U.S. and Canada. Then pare it further by a lat/long box. I found this easiest to do using the Windows "sort" command in a DOS Command window. Sort by latitude: sort /+13 < ghcnd-stations.txt > t1.txt Use text editor on t1.txt to remove entries outside range [39.5, 51.6] Sort by longitude: sort /+22 < t1.txt > t2.txt Use text editor on t2.txt to remove entries outside range [-75.5, -93.5] Sort by state/province code: sort /+39 < t2.txt > t3.txt Use editor on t3.txt to remove entries that are outside Ontario or the GL states. Now just put it back in ID order: sort < t3.txt > relevant-stations.txt There will still be a lot of stations that are not really relevant, but anything beyond this amount of sorting would be labor-intensive at this time (without other software), so I figure it's good enough for now. The GHCND processing will do the actual filtering down to use only the stations relevant to a particular lake basin. What we are doing here is just paring down the number of irrelevant met_*.csv files that will be created. We know that stations in Japan, for example, are irrelevant, so by eliminating them from the list now we avoid processing their data. 3) Uncompress the big tarball using whatever tool you wish. Do NOT extract the files. That will be done by the python code. Most (all?) linux systems will have gunzip installed. On a PC you might use something like 7-zip or another app. gunzip ghcnd_all.tar.gz 4) Use the Python code to extract and build met_*.csv files. mkdir work cd work python ../process_ghcnd.py ../ghcnd_all.tar ../relevant-stations.txt ----------------------------------------------------------------------------------- You will now have a big set of met_*.csv files. They will be named similar to the following, depending on what stations you have chosen to use. met_CA*.csv (Canadian) met_USC*.csv (U.S. Cooperative Network) met_US1*.csv (U.S. something) met_USR*.csv (another U.S. something) met_USW*.csv (stations with WMO id numbers) B) The Integrated Surface Database (ISD) https://www.ncdc.noaa.gov/isd Note that this same dataset has also been referred to as the Integrated Surface Hourly database (ISH) as well as some other names. Similar to the GHCND datset, you now have two options for specifying the list of files. The old method was to use the station history file from NCEI, edited to only contain the set of stations that are relevant. But I have now (2021) added the option to use the same master list file format. It provides that additional flexibility to exclude certain data types and/or years. In the case of this dataset, this is pretty important. In looking at the precipitation timeseries data, it became pretty obvious that while a good number of these ISD stations report precipitation, much of it looks like garbage after converting to daily. There are numerous periods of zero (for years at a time), and other really anomalous-looking stuff. Why that exists -- I don't know for sure. I do know that I have seen a lot of clearly erroneous precip data in the raw data for some stations. The process for converting from hourly to daily is also pretty messy, and may be a contributing factor. That's discussed in some detail in the comments within the processing source code. My recommendation at this time is to turn off precipitation for almost all stations, particularly on the U.S. side of the basin. Just like with the GHCND data set, the processing code will auto-detect the station list format and process things accordingly. The current process for getting this data is to use the make_met process, utilizing the python script. I have named it get_all_ish.py but it's possible that you may have it named something else (e.g. "make met"). In any case, the script reads a listing file (either master-list format or NCEI station history format). This list is used, along with specifying start/end years, to retrieve the data from NCEI one station at a time. The data is stored at NCEI in one file per year per station. The python script downloads all of the yearly files for the station, merges them, and then uses the separate Fortran program named process_ish to build the met_*.csv file for this station. The python script just loops through all stations in the list. The steps here are: 1) Build the process_ish executable from the Fortran source code. I am supplying an appropriate Makefile to use with the make utility. You will need these source code files in the current directory along with the supplied makefile: process_ish.f90 readwritemetcsv.f90 dailymetstationdata.f90 metdatatypesandunits.f90 glshfs_util.F90 (note the capital F in F90; needed by gfortran) cpp_util.cpp To create the executable, you merely need to type: make process_ish 2) Put the resulting process_ish (Linux) or process_ish.exe (Windows) executable file into your working directory. Then use the Python code to extract and build met_*.csv files. For example, suppose you had the python script and master file in your current directory, and the process_ish Fortran stuff was in a subdirectory named fortran... mkdir work cd work cp ../fortran/process_ish . python ../get_all_ish.py ../masterlist_isd.txt Step 2: Build the GLSHFS files You now have a choice between 2 options. I prefer option A, only because it seems a little "cleaner". You can be assured that any old invalid data that may have gotten into your GLSHFS dataset is eliminated from consideration. And if something goes horribly wrong, I haven't messed up my working GLSHFS installation. But it takes a few extra steps. Option B is a little quicker and easier. You get to choose. A) Build all-new files and then, at the end, overwrite your existing GHLFS files. 1) Create a working directory 2) In that directory create subdirectories for each lake (sup, mic, hur, etc) 3) Also create a "stn" directory. 4) In each lake directory, copy over these 2 files from the corresponding directory in your GLSHFS installation: basininfo.txt ???bytcd.map (where ???=sup, mic, hur, etc) 5) Dump all of the new met_*.csv files into the stn directory 6) Edit a copy of your GLSHFS config to point at this new working directory and this stn directory. Turn off the model runs by setting things to "No". You should end up with a section that looks like this: AddStationData = Yes BuildSubbasinMet = Yes UpdateHistorical = No RunForecasts = No MakeSummaryFiles = No 7) Verify that your GLSHFS code is ok to use: i.e. (a) Take note of how many stations you have in the stn directory. (b) In the GLSHFS source code used to create your running copy of GLSHFS, look at the glshfs_global.f90 file. It contains a parameter MaxNumberOfStations that needs to be something greater than the number you noted in (a). If not, change the value so it IS larger, then recompile GLSHFS. 8) Run GLSHFS, specifying the new config file you created in step 6. Note that this can take a long time to run. If you also specify that you want detailed met output it may take multiple days (or weeks). At the end you will have stndata_*.csv and subdata_*.csv files in each lake directory. 9) Copy these new stndata and subdata files into your existing GLSHFS installation, overwriting the old versions of the files. 10) Run GLSHFS as you normally would. It will automatically re-run the models from the beginning. This will take longer than usual, of course. Once you are satisfied that everything is good, you can delete that entire working directory (maybe saving things like the config file if you want to use it the next time you do this). B) Just add these station files to your existing GLSHFS setup. 1) Copy all of the new met_*.csv files into your GLSHFS station file directory. 2) Verify that your GLSHFS code is ok to use: i.e. (a) Take note of how many stations you have in the stn directory. (b) Also note the max number of stations used for any lake by checking the stndata_*.csv files in each lake. (c) Note the BIGGEST number of all this. Most likely it's the stn directory. (d) In the GLSHFS source code used to create your running copy of GLSHFS, look at the glshfs_global.f90 file. It contains a parameter MaxNumberOfStations that needs to be something greater than the number you noted in (a). If not, change the value so it IS larger, then recompile GLSHFS. 3) Run GLSHFS as normal. Note that this can take a long time to run. If you also specify that you want detailed met output it may take multiple days. Step 3: Extract the files that NOAA CO-OPS wants Now that GLSHFS has processed all of that station data we can extract the meteorology data that they want. The primary files of interest are the subdata_???NN.csv files, where ??? = sup, mic, hur, etc and NN=00, 01, 02,... These files contain the aggregated daily subbasin estimates for the seven meteorology variables used as input to GLSHFS. NOAA CO-OPS only cares about precipitation, and they only want monthly estimates for the overlake and overland areas. By convention, GLSHFS uses subbasin 0 to refer to the lake surface, so to get overlake estimates for Lake Superior, we can just use the subdata_sup00.csv file. To get overland estimates we first need to aggregate the subbasins 01..nn I do that with the lump_submet program. It's a simple command-line program. Once compiled from the Fortran source code, you just cd to each lake directory and execute the command, e.g.: cd sup lump_submet sup land This will create a file named subdata_sup_land.csv To create the desired monthly files, use the program glshfs_submet_to_tblm. Once you compile from the Fortran source code... cd sup glshfs_submet_to_tblm sup 00 glshfs_submet_to_tblm sup land This will create a set of files named: sup00_*_mon.csv sup_land_*_mon.csv You only care about the *_prc_*.csv files. I *do* suggest that you rename the overlake file from sup00_prc_mon.csv to sup_lake_prc_mon.csv so as to not be confusing to CO-OPS. You could, optionally, edit these new files to only contain the last 5 years or so of data. Your choice. Do this for each lake and I think you will be done.