Derived variables#

This notebook shows how to use derived variables. A derived variable is a variable that is not available as an input dataset, but computed from one or more input variables.

import pandas as pd
import yaml

import esmvalcore.preprocessor
from esmvalcore.cmor.table import get_tables
from esmvalcore.config import CFG
from esmvalcore.dataset import Dataset, DerivedDataset, datasets_to_recipe

pd.set_option("display.max_colwidth", None)

First, we configure ESMValCore so it searches the ESGF for data:

CFG["projects"]["CMIP6"].pop(
    "data",
    None,
)  # Clear existing CMIP6 configuration for finding input data
CFG.nested_update(
    {
        "projects": {
            "CMIP6": {
                "data": {
                    "intake-esgf": {
                        "type": "esmvalcore.io.intake_esgf.IntakeESGFDataSource",
                        "priority": 2,
                        "facets": {
                            "activity": "activity_drs",
                            "dataset": "source_id",
                            "ensemble": "member_id",
                            "exp": "experiment_id",
                            "institute": "institution_id",
                            "grid": "grid_label",
                            "mip": "table_id",
                            "project": "project",
                            "short_name": "variable_id",
                        },
                    },
                },
            },
        },
    },
)

Which variables can be derived?#

The interface for working with derived variables from Python is not very polished yet. To list all available derived variables, we can run:

pd.DataFrame.from_dict(
    [
        {
            "short_name": short_name,
        }
        | {
            k: getattr(
                get_tables(CFG, project="CMIP6").get_variable(
                    table_name="x",
                    short_name=short_name,
                    derived=True,
                ),
                k,
                None,
            )
            for k in ["units", "long_name"]
        }
        for short_name in esmvalcore.preprocessor._derive.ALL_DERIVED_VARIABLES  # noqa: SLF001
    ],
).sort_values("short_name")
short_name units long_name
29 alb 1 albedo at the surface
38 amoc kg s-1 Atlantic Meridional Overturning Circulation
44 asr W m-2 Absorbed shortwave radiation
32 chlora kg m-3 chlorophyll concentration
46 clhmtisccp % ISCCP High Level Medium-Thickness Cloud Area Fraction
2 clhtkisccp % ISCCP high level thick cloud area fraction
7 cllmtisccp % ISCCP Low Level Medium-Thickness Cloud Area Fraction
11 clltkisccp % ISCCP low level thick cloud area fraction
0 clmmtisccp % ISCCP Middle Level Medium-Thickness Cloud Area Fraction
36 clmtkisccp % ISCCP Middle Level Thick Cloud Area Fraction
40 co2s 1e-06 Atmosphere CO2
42 ctotal kg m-2 Total Carbon Mass in Ecosystem
47 et mm day-1 Evapotranspiration
5 hfns W m-2 Surface Net Heat Flux
9 hurs % Near-Surface Relative Humidity
26 lapserate K km-1 Lapse Rate
20 lvp W m-2 Latent Heat Release from Precipitation
8 lwcre W m-2 TOA Longwave Cloud Radiative Effect
41 lwp kg m-2 Liquid Water Path
31 netcre W m-2 TOA Net Cloud Radiative Effect
23 ohc J Heat content in grid cell
43 qep kg m-2 s-1 Net moisture flux into atmosphere
39 rlns W m-2 Surface Net downward Longwave Radiation
13 rlnst W m-2 Net Atmospheric Longwave Cooling
33 rlnstcs W m-2 Net Atmospheric Longwave Cooling assuming clear sky
12 rlntcs W m-2 TOA Net downward Longwave Radiation assuming clear sky
45 rlus W m-2 Surface Upwelling Longwave Radiation
28 rsns W m-2 Surface Net downward Shortwave Radiation
25 rsnst W m-2 Heating from Shortwave Absorption
34 rsnstcs W m-2 Heating from Shortwave Absorption assuming clear sky
22 rsnstcsnorm % Heating from Shortwave Absorption assuming clear sky normalized by incoming solar radiation
27 rsnt W m-2 TOA Net downward Shortwave Radiation
3 rsntcs W m-2 TOA Net downward Shortwave Radiation assuming clear sky
10 rsus W m-2 Surface Upwelling Shortwave Radiation
17 rtnt W m-2 TOA Net downward Total Radiation
1 sfcwind NaN NaN
30 siextent 1 Sea Ice Extent
14 sispeed m s-1 Sea-Ice Speed
37 sithick m Sea Ice Thickness
15 sm m3 m-3 Volumetric Moisture in Upper Portion of Soil Column
16 soz m Stratospheric Ozone Column (O3 mole fraction >= 125 ppb)
4 swcre W m-2 TOA Shortwave Cloud Radiative Effect
21 toz m Total Column Ozone
6 troz m Tropospheric Ozone Column (O3 mole fraction < 125 ppb)
35 uajet degrees Jet position expressed as latitude of maximum meridional wind speed
19 vegfrac % Vegetation Fraction
24 xch4 1 Column-average Dry-air Mole Fraction of Atmospheric Methane
18 xco2 1 Column-average Dry-air Mole Fraction of Atmospheric Carbon Dioxide

Note that modules, functions, and variables starting with a single _ character should be considered internal, so there are no guarantees about the stability of this interface.

Finding available datasets#

We define a dataset template to search for all CMIP6 models that provide all required input datasets to derive lwcre or longwave cloud radiative effect at the top of atmosphere on a monthly resolution for the historical experiment. Note that ESMValCore uses its own names for the facets for a more uniform naming across different CMIP phases and other projects. The mapping to the facet names used on ESGF can be found in Facets.

dataset_template = DerivedDataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
)

Next, we use the DerivedDataset.from_files method to build a list of datasets from the available files. This may take a while as searching the ESGF for many files may be a bit slow. Because the search results are cached, subsequent searches will be faster.

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets, showing the first 10:")
datasets[:10]
Found 37 datasets, showing the first 10:
[DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=TaiESM1, institute=AS-RCEC, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=AWI-CM-1-1-MR, institute=AWI, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=AWI-ESM-1-1-LR, institute=AWI, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=BCC-CSM2-MR, institute=BCC, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=BCC-ESM1, institute=BCC, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=CAMS-CSM1-0, institute=CAMS, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=CAS-ESM2-0, institute=CAS, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=FGOALS-g3, institute=CAS, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=IITM-ESM, institute=CCCR-IITM, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=CanESM5-1, institute=CCCma, ensemble=r1i1p1f1, grid=gn)]

Composing a recipe with derived variables#

To use the datasets found above in a recipe, we will want to use the name of the variable that needs to be derived, along with the derive: true option:

recipe_datasets = [
    Dataset(
        diagnostic="diagnostic_name",
        derive=True,
        **dataset.facets,
    )
    for dataset in datasets
]
print(yaml.safe_dump(datasets_to_recipe(recipe_datasets)))
datasets:
- dataset: ACCESS-CM2
  institute: CSIRO-ARCCSS
- dataset: ACCESS-ESM1-5
  institute: CSIRO
- dataset: AWI-CM-1-1-MR
  institute: AWI
- dataset: AWI-ESM-1-1-LR
  institute: AWI
- dataset: BCC-CSM2-MR
  institute: BCC
- dataset: BCC-ESM1
  institute: BCC
- dataset: CAMS-CSM1-0
  institute: CAMS
- dataset: CAS-ESM2-0
  institute: CAS
- dataset: CESM2
  institute: NCAR
- dataset: CESM2-FV2
  institute: NCAR
- dataset: CESM2-WACCM
  institute: NCAR
- dataset: CESM2-WACCM-FV2
  institute: NCAR
- dataset: CMCC-CM2-HR4
  institute: CMCC
- dataset: CMCC-CM2-SR5
  institute: CMCC
- dataset: CMCC-ESM2
  institute: CMCC
- dataset: CanESM5
  institute: CCCma
- dataset: CanESM5-1
  institute: CCCma
- dataset: FGOALS-g3
  institute: CAS
- dataset: FIO-ESM-2-0
  institute: FIO-QLNM
- dataset: GISS-E2-1-G
  institute: NASA-GISS
- dataset: GISS-E2-1-G-CC
  institute: NASA-GISS
- dataset: GISS-E2-1-H
  institute: NASA-GISS
- dataset: GISS-E2-2-G
  institute: NASA-GISS
- dataset: GISS-E2-2-H
  institute: NASA-GISS
- dataset: ICON-ESM-LR
  institute: MPI-M
- dataset: IITM-ESM
  institute: CCCR-IITM
- dataset: MIROC6
  institute: MIROC
- dataset: MPI-ESM-1-2-HAM
  institute: HAMMOZ-Consortium
- dataset: MPI-ESM1-2-HR
  institute: MPI-M
- dataset: MPI-ESM1-2-LR
  institute: MPI-M
- dataset: MRI-ESM2-0
  institute: MRI
- dataset: NESM3
  institute: NUIST
- dataset: NorCPM1
  institute: NCC
- dataset: NorESM2-LM
  institute: NCC
- dataset: NorESM2-MM
  institute: NCC
- dataset: SAM0-UNICON
  institute: SNU
- dataset: TaiESM1
  institute: AS-RCEC
diagnostics:
  diagnostic_name:
    variables:
      lwcre:
        derive: true
        ensemble: r1i1p1f1
        exp: historical
        grid: gn
        mip: Amon
        project: CMIP6

There is also a force_derivation option available for use in the recipe, when set to true that will cause the variable to be derived even if it is already available as a dataset.

Computing the derived variable#

Let’s load the data to derive the first dataset:

dataset = datasets[0]
dataset
DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=TaiESM1, institute=AS-RCEC, ensemble=r1i1p1f1, grid=gn)
cubes = dataset.load()
cubes
WARNING:esmvalcore.cmor.check:There were warnings in variable rlut:
 rlut: attribute positive not present
loaded from file 
WARNING:esmvalcore.cmor.check:There were warnings in variable rlutcs:
 rlutcs: attribute positive not present
loaded from file 
Toa Longwave Cloud Radiative Effect (W m-2) time latitude longitude
Shape 1980 192 288
Dimension coordinates
time x - -
latitude - x -
longitude - - x
Attributes
Conventions 'CF-1.7 CMIP-6.2'
activity_drs 'CMIP'
activity_id 'CMIP'
branch_method 'Hybrid-restart from year 0671-01-01 of piControl'
branch_time 0.0
branch_time_in_child -674885
branch_time_in_parent 171550.0
cmor_version '3.5.0'
contact 'Dr. Wei-Liang Lee (leelupin@gate.sinica.edu.tw)'
data_specs_version '01.00.31'
experiment 'all-forcing simulation of the recent past'
experiment_id 'historical'
external_variables 'areacella'
forcing_index 1
frequency 'mon'
further_info_url 'https://furtherinfo.es-doc.org/CMIP6.AS-RCEC.TaiESM1.historical.none.r ...'
grid 'finite-volume grid with 0.9x1.25 degree lat/lon resolution'
grid_label 'gn'
initialization_index 1
institution 'Research Center for Environmental Changes, Academia Sinica, Nankang, Taipei ...'
institution_id 'AS-RCEC'
license 'CMIP6 model data produced by NCC is licensed under a Creative Commons Attribution ...'
member_id 'r1i1p1f1'
mip_era 'CMIP6'
model_id 'TaiESM1'
nominal_resolution '100 km'
original_units 'W/m2'
parent_activity_id 'CMIP'
parent_experiment_id 'piControl'
parent_mip_era 'CMIP6'
parent_source_id 'TaiESM1'
parent_sub_experiment_id 'none'
parent_time_units 'days since 1850-1-1 00:00:00'
parent_variant_label 'r1i1p1f1'
physics_index 1
positive 'down'
product 'model-output'
realization_index 1
realm 'atmos'
references '10.5194/gmd-2019-377'
run_variant 'N/A'
source 'TaiESM 1.0 (2018): \naerosol: SNAP (same grid as atmos)\natmos: TaiAM1 ...'
source_id 'TaiESM1'
source_type 'AOGCM AER BGC'
sub_experiment 'none'
sub_experiment_id 'none'
table_id 'Amon'
table_info 'Creation Date:(24 July 2019) MD5:0bb394a356ef9d214d027f1aca45853e'
title 'TaiESM1 output prepared for CMIP6'
variant_label 'r1i1p1f1'

Implementing your own derived variables#

Guidance on adding new built-in derived variables to ESMValCore is available in Deriving a variable. However, if you are only using the Python interface, you can define an ad-hoc derived variable by subclassing the DerivedDataset class and implementing a custom required attribute and derive method. The required attribute defines the facets that describe the input data:

dataset.required
[{'short_name': 'rlut'}, {'short_name': 'rlutcs'}]

in this case we see that lwcre is derived from variables rlut and rlutcs. The derive method is a function that takes the iris cubes resulting from loading the datasets described by the facets and required attribute as an argument, and computes the derived variable.