Procedure for building files requested by NOAA CO-OPS.

Provided by Tim Hunter
Updated February 10, 2021

Step 1: Retrieve raw station data from NCEI.
   We use two data sets as input.  
   A) The Global Historical Climatology Network Daily
      https://www.ncdc.noaa.gov/ghcn-daily-description
      
      *Revised in February 2021
      The old instructions made use of the all-inclusive large tar file ghcnd_all.tar.gz
      in order to simplify the process. But I now recommend using the master list option
      as described below.

      -----------------------------------------------------------------------------------
      Method 1 (master list with station-specific downloading):
      
        1) Edit your master list file as needed. This might mean adding a new station
           that you have decided you want to use, or deleting some, or updating the
           excluded years because you have seen something in the data timeseries.
         
        2) Use the Python code to extract and build met_*.csv files. Note that the
           python script will auto-detect the format of the station list. This python
           script will only download the specific stations that are listed.
            mkdir work
            cd work
            python ../process_ghcnd_by_stn.py  ../master_list.txt
            
            
      -----------------------------------------------------------------------------------
      Method 2 (master list; using the giant tarball file):
      
        1) Download the giant compressed tarball from NCDC named ghcnd_all.tar.gz
           Current location as of 1/12/2021 on NCDC ftp site is:
           ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
           Current size as of 1/12/2021 is 3.3 GB.

        2) Edit your master list file as needed. This might mean adding a new station
           that you have decided you want to use, or deleting some, or updating the
           excluded years because you have seen something in the data timeseries.
           
        3) Uncompress the big tarball using whatever tool you wish.  Do NOT extract
           the files. That will be done by the python code. Most (all?) linux systems
           will have gunzip installed. On a PC you might use something like 7-zip or
           another app.
              gunzip ghcnd_all.tar.gz

        4) Use the Python code to extract and build met_*.csv files. Note that the
           python script will auto-detect the format of the station list.
            mkdir work
            cd work
            python ../process_ghcnd.py  ../ghcnd_all.tar ../master_list.txt
            
            
      -----------------------------------------------------------------------------------
      Method 3 (using giant tarball file and NCEI-supplied station history file):
 
      This would still work, but I only recommend it if you are kind of
      "starting over" with station selection. You will lose the ability
      to filter out years of bad data, etc.  But if you want to see all
      available station data, this is the way to do that.

      The following set of instructions come from notes to myself.  Be aware that I
      generally do most of this processing on our linux system, so commands/tools may be
      just slightly different if you are working on PCs only:
 
        1) Download the giant compressed tarball from NCDC named ghcnd_all.tar.gz
           Current location as of 1/12/2021 on NCDC ftp site is:
           ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
           Current size as of 1/12/2021 is 3.3 GB.

        2) Download the ghcnd-stations.txt file from NCDC. Same location as above.
           [Optional, but highly recommended step]
           Edit the file so that it only contains entries for stations in U.S. and Canada.
           Then pare it further by a lat/long box.  I found this easiest to do using
           the Windows "sort" command in a DOS Command window.
            Sort by latitude:
               sort /+13 < ghcnd-stations.txt > t1.txt
               Use text editor on t1.txt to remove entries outside range [39.5, 51.6]
            Sort by longitude:
               sort /+22 < t1.txt > t2.txt
               Use text editor on t2.txt to remove entries outside range [-75.5, -93.5]
            Sort by state/province code:
               sort /+39 < t2.txt > t3.txt
               Use editor on t3.txt to remove entries that are outside Ontario or the GL states.
            Now just put it back in ID order:
               sort < t3.txt > relevant-stations.txt
           
            There will still be a lot of stations that are not really relevant, but
            anything beyond this amount of sorting would be labor-intensive at this
            time (without other software), so I figure it's good enough for now.
            The GHCND processing will do the actual filtering down to use only the
            stations relevant to a particular lake basin. What we are doing here is just
            paring down the number of irrelevant met_*.csv files that will be created.
            We know that stations in Japan, for example, are irrelevant, so 
            by eliminating them from the list now we avoid processing their data.

        3) Uncompress the big tarball using whatever tool you wish.  Do NOT extract
           the files. That will be done by the python code. Most (all?) linux systems
           will have gunzip installed. On a PC you might use something like 7-zip or
           another app.
              gunzip ghcnd_all.tar.gz

        4) Use the Python code to extract and build met_*.csv files. 
              mkdir work
              cd work
              python ../process_ghcnd.py  ../ghcnd_all.tar ../relevant-stations.txt

      -----------------------------------------------------------------------------------
            
      You will now have a big set of met_*.csv files.  They will be named similar to
      the following, depending on what stations you have chosen to use.
         met_CA*.csv      (Canadian)
         met_USC*.csv     (U.S. Cooperative Network)
         met_US1*.csv     (U.S. something)
         met_USR*.csv     (another U.S. something)
         met_USW*.csv     (stations with WMO id numbers)
         

         
   B) The Integrated Surface Database (ISD)
      https://www.ncdc.noaa.gov/isd
      Note that this same dataset has also been referred to as the Integrated Surface Hourly
      database (ISH) as well as some other names.
      
      Similar to the GHCND datset, you now have two options for specifying the list of files.
      The old method was to use the station history file from NCEI, edited to only contain
      the set of stations that are relevant. But I have now (2021) added the option to use
      the same master list file format. It provides that additional flexibility to exclude
      certain data types and/or years.  In the case of this dataset, this is pretty important.
      
      In looking at the precipitation timeseries data, it became pretty obvious that while a 
      good number of these ISD stations report precipitation, much of it looks like garbage
      after converting to daily. There are numerous periods of zero (for years at a time), and
      other really anomalous-looking stuff. Why that exists -- I don't know for sure. I do know
      that I have seen a lot of clearly erroneous precip data in the raw data for some stations.
      The process for converting from hourly to daily is also pretty messy, and may be a
      contributing factor. That's discussed in some detail in the comments within the processing 
      source code. My recommendation at this time is to turn off precipitation for almost all 
      stations, particularly on the U.S. side of the basin.
      
      Just like with the GHCND data set, the processing code will auto-detect the station list
      format and process things accordingly. 
      
      The current process for getting this data is to use the make_met process, utilizing the
      python script.  I have named it get_all_ish.py but it's possible that you may have 
      it named something else (e.g. "make met").  In any case, the script reads a listing
      file (either master-list format or NCEI station history format). This list is used, 
      along with specifying start/end years, to retrieve the data from NCEI one station
      at a time. The data is stored at NCEI in one file per year per station. The python
      script downloads all of the yearly files for the station, merges them, and then
      uses the separate Fortran program named process_ish to build the met_*.csv file for
      this station. The python script just loops through all stations in the list.
      
      The steps here are:
      
      1) Build the process_ish executable from the Fortran source code. I am supplying an
         appropriate Makefile to use with the make utility.
         You will need these source code files in the current directory along with the
         supplied makefile:
            process_ish.f90
            readwritemetcsv.f90
            dailymetstationdata.f90
            metdatatypesandunits.f90
            glshfs_util.F90           (note the capital F in F90; needed by gfortran)
            cpp_util.cpp
        
         To create the executable, you merely need to type:        
            make process_ish
         
      2) Put the resulting process_ish (Linux) or process_ish.exe (Windows) executable
         file into your working directory. Then use the Python code to extract and 
         build met_*.csv files. For example, suppose you had the python script and
         master file in your current directory, and the process_ish Fortran stuff was
         in a subdirectory named fortran...
            mkdir work
            cd work
            cp ../fortran/process_ish .
            python ../get_all_ish.py  ../masterlist_isd.txt
      
      
   
Step 2: Build the GLSHFS files

You now have a choice between 2 options. I prefer option A, only because it seems a little
"cleaner". You can be assured that any old invalid data that may have gotten into your
GLSHFS dataset is eliminated from consideration. And if something goes horribly wrong, I
haven't messed up my working GLSHFS installation. But it takes a few extra steps.  Option B
is a little quicker and easier. You get to choose.
  A) Build all-new files and then, at the end, overwrite your existing GHLFS files.
     1) Create a working directory
     2) In that directory create subdirectories for each lake (sup, mic, hur, etc)
     3) Also create a "stn" directory.
     4) In each lake directory, copy over these 2 files from the corresponding
        directory in your GLSHFS installation:
           basininfo.txt
           ???bytcd.map   (where ???=sup, mic, hur, etc)
     5) Dump all of the new met_*.csv files into the stn directory
     6) Edit a copy of your GLSHFS config to point at this new working directory and
        this stn directory. Turn off the model runs by setting things to "No".
        You should end up with a section that looks like this:
            AddStationData   = Yes
            BuildSubbasinMet = Yes
            UpdateHistorical = No
            RunForecasts     = No
            MakeSummaryFiles = No        
     7) Verify that your GLSHFS code is ok to use:
         i.e. (a) Take note of how many stations you have in the stn directory.
              (b) In the GLSHFS source code used to create your running copy of GLSHFS,
                  look at the glshfs_global.f90 file. It contains a parameter 
                  MaxNumberOfStations that needs to be something greater than 
                  the number you noted in (a). If not, change the value so it IS 
                  larger, then recompile GLSHFS.
     8) Run GLSHFS, specifying the new config file you created in step 6.
        Note that this can take a long time to run. If you also specify that you
        want detailed met output it may take multiple days (or weeks).
        At the end you will have stndata_*.csv and subdata_*.csv files in each lake
        directory.
     9) Copy these new stndata and subdata files into your existing GLSHFS installation,
        overwriting the old versions of the files. 
    10) Run GLSHFS as you normally would. It will automatically re-run the models 
        from the beginning. This will take longer than usual, of course.  Once 
        you are satisfied that everything is good, you can delete that entire 
        working directory (maybe saving things like the config file if you want
        to use it the next time you do this).
      
  B) Just add these station files to your existing GLSHFS setup.
     1) Copy all of the new met_*.csv files into your GLSHFS station file directory.
     2) Verify that your GLSHFS code is ok to use:
         i.e. (a) Take note of how many stations you have in the stn directory.
              (b) Also note the max number of stations used for any lake by checking the
                  stndata_*.csv files in each lake.
              (c) Note the BIGGEST number of all this. Most likely it's the stn directory.
              (d) In the GLSHFS source code used to create your running copy of GLSHFS,
                  look at the glshfs_global.f90 file. It contains a parameter 
                  MaxNumberOfStations that needs to be something greater than 
                  the number you noted in (a). If not, change the value so it IS 
                  larger, then recompile GLSHFS.
     3) Run GLSHFS as normal.
        Note that this can take a long time to run. If you also specify that you
        want detailed met output it may take multiple days.
     
Step 3:  Extract the files that NOAA CO-OPS wants

Now that GLSHFS has processed all of that station data we can extract the 
meteorology data that they want.
     
The primary files of interest are the subdata_???NN.csv files, 
   where ??? = sup, mic, hur, etc    and NN=00, 01, 02,...
These files contain the aggregated daily subbasin estimates for the seven
meteorology variables used as input to GLSHFS.  NOAA CO-OPS only cares about
precipitation, and they only want monthly estimates for the overlake and 
overland areas.

By convention, GLSHFS uses subbasin 0 to refer to the lake surface, so to get
overlake estimates for Lake Superior, we can just use the subdata_sup00.csv file.

To get overland estimates we first need to aggregate the subbasins 01..nn
I do that with the lump_submet program.  It's a simple command-line program. 
Once compiled from the Fortran source code, you just cd to each lake directory
and execute the command, e.g.:
   cd sup
   lump_submet sup land
This will create a file named subdata_sup_land.csv

To create the desired monthly files, use the program glshfs_submet_to_tblm.
Once you compile from the Fortran source code...
   cd sup
   glshfs_submet_to_tblm sup 00
   glshfs_submet_to_tblm sup land
   
This will create a set of files named:
   sup00_*_mon.csv
   sup_land_*_mon.csv

You only care about the *_prc_*.csv files.
I *do* suggest that you rename the overlake file from sup00_prc_mon.csv
to sup_lake_prc_mon.csv so as to not be confusing to CO-OPS.

You could, optionally, edit these new files to only contain the last 5
years or so of data. Your choice.

Do this for each lake and I think you will be done.