Derived variables#

This notebook shows how to use derived variables. A derived variable is a variable that is not available as an input dataset, but computed from one or more input variables.

import pandas as pd
import yaml

import esmvalcore.preprocessor
from esmvalcore.cmor.table import get_tables
from esmvalcore.config import CFG
from esmvalcore.dataset import Dataset, DerivedDataset, datasets_to_recipe

pd.set_option("display.max_colwidth", None)

First, we configure ESMValCore so it searches the ESGF for data:

CFG["projects"]["CMIP6"].pop(
    "data",
    None,
)  # Clear existing CMIP6 configuration for finding input data
CFG.nested_update(
    {
        "projects": {
            "CMIP6": {
                "data": {
                    "intake-esgf": {
                        "type": "esmvalcore.io.intake_esgf.IntakeESGFDataSource",
                        "priority": 2,
                        "facets": {
                            "activity": "activity_drs",
                            "dataset": "source_id",
                            "ensemble": "member_id",
                            "exp": "experiment_id",
                            "institute": "institution_id",
                            "grid": "grid_label",
                            "mip": "table_id",
                            "project": "project",
                            "short_name": "variable_id",
                        },
                    },
                },
            },
        },
    },
)

Which variables can be derived?#

The interface for working with derived variables from Python is not very polished yet. To list all available derived variables, we can run:

pd.DataFrame.from_dict(
    [
        {
            "short_name": short_name,
        }
        | {
            k: getattr(
                get_tables(CFG, project="CMIP6").get_variable(
                    table_name="x",
                    short_name=short_name,
                    derived=True,
                ),
                k,
                None,
            )
            for k in ["units", "long_name"]
        }
        for short_name in esmvalcore.preprocessor._derive.ALL_DERIVED_VARIABLES  # noqa: SLF001
    ],
).sort_values("short_name")

	short_name	units	long_name
29	alb	1	albedo at the surface
38	amoc	kg s-1	Atlantic Meridional Overturning Circulation
44	asr	W m-2	Absorbed shortwave radiation
32	chlora	kg m-3	chlorophyll concentration
46	clhmtisccp	%	ISCCP High Level Medium-Thickness Cloud Area Fraction
2	clhtkisccp	%	ISCCP high level thick cloud area fraction
7	cllmtisccp	%	ISCCP Low Level Medium-Thickness Cloud Area Fraction
11	clltkisccp	%	ISCCP low level thick cloud area fraction
0	clmmtisccp	%	ISCCP Middle Level Medium-Thickness Cloud Area Fraction
36	clmtkisccp	%	ISCCP Middle Level Thick Cloud Area Fraction
40	co2s	1e-06	Atmosphere CO2
42	ctotal	kg m-2	Total Carbon Mass in Ecosystem
47	et	mm day-1	Evapotranspiration
5	hfns	W m-2	Surface Net Heat Flux
9	hurs	%	Near-Surface Relative Humidity
26	lapserate	K km-1	Lapse Rate
20	lvp	W m-2	Latent Heat Release from Precipitation
8	lwcre	W m-2	TOA Longwave Cloud Radiative Effect
41	lwp	kg m-2	Liquid Water Path
31	netcre	W m-2	TOA Net Cloud Radiative Effect
23	ohc	J	Heat content in grid cell
43	qep	kg m-2 s-1	Net moisture flux into atmosphere
39	rlns	W m-2	Surface Net downward Longwave Radiation
13	rlnst	W m-2	Net Atmospheric Longwave Cooling
33	rlnstcs	W m-2	Net Atmospheric Longwave Cooling assuming clear sky
12	rlntcs	W m-2	TOA Net downward Longwave Radiation assuming clear sky
45	rlus	W m-2	Surface Upwelling Longwave Radiation
28	rsns	W m-2	Surface Net downward Shortwave Radiation
25	rsnst	W m-2	Heating from Shortwave Absorption
34	rsnstcs	W m-2	Heating from Shortwave Absorption assuming clear sky
22	rsnstcsnorm	%	Heating from Shortwave Absorption assuming clear sky normalized by incoming solar radiation
27	rsnt	W m-2	TOA Net downward Shortwave Radiation
3	rsntcs	W m-2	TOA Net downward Shortwave Radiation assuming clear sky
10	rsus	W m-2	Surface Upwelling Shortwave Radiation
17	rtnt	W m-2	TOA Net downward Total Radiation
1	sfcwind	NaN	NaN
30	siextent	1	Sea Ice Extent
14	sispeed	m s-1	Sea-Ice Speed
37	sithick	m	Sea Ice Thickness
15	sm	m3 m-3	Volumetric Moisture in Upper Portion of Soil Column
16	soz	m	Stratospheric Ozone Column (O3 mole fraction >= 125 ppb)
4	swcre	W m-2	TOA Shortwave Cloud Radiative Effect
21	toz	m	Total Column Ozone
6	troz	m	Tropospheric Ozone Column (O3 mole fraction < 125 ppb)
35	uajet	degrees	Jet position expressed as latitude of maximum meridional wind speed
19	vegfrac	%	Vegetation Fraction
24	xch4	1	Column-average Dry-air Mole Fraction of Atmospheric Methane
18	xco2	1	Column-average Dry-air Mole Fraction of Atmospheric Carbon Dioxide

Note that modules, functions, and variables starting with a single _ character should be considered internal, so there are no guarantees about the stability of this interface.

Finding available datasets#

We define a dataset template to search for all CMIP6 models that provide all required input datasets to derive lwcre or longwave cloud radiative effect at the top of atmosphere on a monthly resolution for the historical experiment. Note that ESMValCore uses its own names for the facets for a more uniform naming across different CMIP phases and other projects. The mapping to the facet names used on ESGF can be found in Facets.

dataset_template = DerivedDataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
)

Next, we use the DerivedDataset.from_files method to build a list of datasets from the available files. This may take a while as searching the ESGF for many files may be a bit slow. Because the search results are cached, subsequent searches will be faster.

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets, showing the first 10:")
datasets[:10]

Found 37 datasets, showing the first 10:

[DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=TaiESM1, institute=AS-RCEC, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=AWI-CM-1-1-MR, institute=AWI, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=AWI-ESM-1-1-LR, institute=AWI, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=BCC-CSM2-MR, institute=BCC, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=BCC-ESM1, institute=BCC, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=CAMS-CSM1-0, institute=CAMS, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=CAS-ESM2-0, institute=CAS, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=FGOALS-g3, institute=CAS, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=IITM-ESM, institute=CCCR-IITM, ensemble=r1i1p1f1, grid=gn),
 DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=CanESM5-1, institute=CCCma, ensemble=r1i1p1f1, grid=gn)]

Composing a recipe with derived variables#

To use the datasets found above in a recipe, we will want to use the name of the variable that needs to be derived, along with the derive: true option:

recipe_datasets = [
    Dataset(
        diagnostic="diagnostic_name",
        derive=True,
        **dataset.facets,
    )
    for dataset in datasets
]
print(yaml.safe_dump(datasets_to_recipe(recipe_datasets)))

datasets:
- dataset: ACCESS-CM2
  institute: CSIRO-ARCCSS
- dataset: ACCESS-ESM1-5
  institute: CSIRO
- dataset: AWI-CM-1-1-MR
  institute: AWI
- dataset: AWI-ESM-1-1-LR
  institute: AWI
- dataset: BCC-CSM2-MR
  institute: BCC
- dataset: BCC-ESM1
  institute: BCC
- dataset: CAMS-CSM1-0
  institute: CAMS
- dataset: CAS-ESM2-0
  institute: CAS
- dataset: CESM2
  institute: NCAR
- dataset: CESM2-FV2
  institute: NCAR
- dataset: CESM2-WACCM
  institute: NCAR
- dataset: CESM2-WACCM-FV2
  institute: NCAR
- dataset: CMCC-CM2-HR4
  institute: CMCC
- dataset: CMCC-CM2-SR5
  institute: CMCC
- dataset: CMCC-ESM2
  institute: CMCC
- dataset: CanESM5
  institute: CCCma
- dataset: CanESM5-1
  institute: CCCma
- dataset: FGOALS-g3
  institute: CAS
- dataset: FIO-ESM-2-0
  institute: FIO-QLNM
- dataset: GISS-E2-1-G
  institute: NASA-GISS
- dataset: GISS-E2-1-G-CC
  institute: NASA-GISS
- dataset: GISS-E2-1-H
  institute: NASA-GISS
- dataset: GISS-E2-2-G
  institute: NASA-GISS
- dataset: GISS-E2-2-H
  institute: NASA-GISS
- dataset: ICON-ESM-LR
  institute: MPI-M
- dataset: IITM-ESM
  institute: CCCR-IITM
- dataset: MIROC6
  institute: MIROC
- dataset: MPI-ESM-1-2-HAM
  institute: HAMMOZ-Consortium
- dataset: MPI-ESM1-2-HR
  institute: MPI-M
- dataset: MPI-ESM1-2-LR
  institute: MPI-M
- dataset: MRI-ESM2-0
  institute: MRI
- dataset: NESM3
  institute: NUIST
- dataset: NorCPM1
  institute: NCC
- dataset: NorESM2-LM
  institute: NCC
- dataset: NorESM2-MM
  institute: NCC
- dataset: SAM0-UNICON
  institute: SNU
- dataset: TaiESM1
  institute: AS-RCEC
diagnostics:
  diagnostic_name:
    variables:
      lwcre:
        derive: true
        ensemble: r1i1p1f1
        exp: historical
        grid: gn
        mip: Amon
        project: CMIP6

There is also a force_derivation option available for use in the recipe, when set to true that will cause the variable to be derived even if it is already available as a dataset.

Computing the derived variable#

Let’s load the data to derive the first dataset:

dataset = datasets[0]
dataset

DerivedDataset(short_name=lwcre, mip=Amon, project=CMIP6, exp=historical, dataset=TaiESM1, institute=AS-RCEC, ensemble=r1i1p1f1, grid=gn)

cubes = dataset.load()
cubes

WARNING:esmvalcore.cmor.check:There were warnings in variable rlut:
 rlut: attribute positive not present
loaded from file 
WARNING:esmvalcore.cmor.check:There were warnings in variable rlutcs:
 rlutcs: attribute positive not present
loaded from file 

Toa Longwave Cloud Radiative Effect (W m-2)	time	latitude	longitude
Shape	1980	192	288
Dimension coordinates
time	x	-	-
latitude	-	x	-
longitude	-	-	x
Attributes
Conventions	'CF-1.7 CMIP-6.2'
activity_drs	'CMIP'
activity_id	'CMIP'
branch_method	'Hybrid-restart from year 0671-01-01 of piControl'
branch_time	0.0
branch_time_in_child	-674885
branch_time_in_parent	171550.0
cmor_version	'3.5.0'
contact	'Dr. Wei-Liang Lee (leelupin@gate.sinica.edu.tw)'
data_specs_version	'01.00.31'
experiment	'all-forcing simulation of the recent past'
experiment_id	'historical'
external_variables	'areacella'
forcing_index	1
frequency	'mon'
further_info_url	'https://furtherinfo.es-doc.org/CMIP6.AS-RCEC.TaiESM1.historical.none.r ...'
grid	'finite-volume grid with 0.9x1.25 degree lat/lon resolution'
grid_label	'gn'
initialization_index	1
institution	'Research Center for Environmental Changes, Academia Sinica, Nankang, Taipei ...'
institution_id	'AS-RCEC'
license	'CMIP6 model data produced by NCC is licensed under a Creative Commons Attribution ...'
member_id	'r1i1p1f1'
mip_era	'CMIP6'
model_id	'TaiESM1'
nominal_resolution	'100 km'
original_units	'W/m2'
parent_activity_id	'CMIP'
parent_experiment_id	'piControl'
parent_mip_era	'CMIP6'
parent_source_id	'TaiESM1'
parent_sub_experiment_id	'none'
parent_time_units	'days since 1850-1-1 00:00:00'
parent_variant_label	'r1i1p1f1'
physics_index	1
positive	'down'
product	'model-output'
realization_index	1
realm	'atmos'
references	'10.5194/gmd-2019-377'
run_variant	'N/A'
source	'TaiESM 1.0 (2018): \naerosol: SNAP (same grid as atmos)\natmos: TaiAM1 ...'
source_id	'TaiESM1'
source_type	'AOGCM AER BGC'
sub_experiment	'none'
sub_experiment_id	'none'
table_id	'Amon'
table_info	'Creation Date:(24 July 2019) MD5:0bb394a356ef9d214d027f1aca45853e'
title	'TaiESM1 output prepared for CMIP6'
variant_label	'r1i1p1f1'

Implementing your own derived variables#

Guidance on adding new built-in derived variables to ESMValCore is available in Deriving a variable. However, if you are only using the Python interface, you can define an ad-hoc derived variable by subclassing the DerivedDataset class and implementing a custom required attribute and derive method. The required attribute defines the facets that describe the input data:

dataset.required

[{'short_name': 'rlut'}, {'short_name': 'rlutcs'}]

in this case we see that lwcre is derived from variables rlut and rlutcs. The derive method is a function that takes the iris cubes resulting from loading the datasets described by the facets and required attribute as an argument, and computes the derived variable.