xarray-contrib · VeckoTheGecko · Jun 17, 2026 · Jul 3, 2026 · VeckoTheGecko · Jul 3, 2026
diff --git a/intermediate/cataloging.ipynb b/intermediate/cataloging.ipynb
@@ -0,0 +1,202 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0",
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "source": [
+    "# Data cataloguing for Xarray\n",
+    "\n",
+    "**Goals:** At the end of this tutorial, you'll have an overview of what data cataloging is, why it is done, what tools are available. You'll also know how to open and browse some data catalogues.\n",
+    "\n",
+    "> A data catalog is a detailed inventory of data assets within an organization. It helps users easily discover, understand, manage, curate and access data. -[IBM](https://www.ibm.com/think/topics/data-catalog)\n",
+    "\n",
+    "Within an organisation, you may work with a lot of different datasets. As someone managing this data, you might want the following things:\n",
+    "\n",
+    "- **An organized, easily browsable collection**: By easing discovery and loading of your datasets you can reduce the friction for those analysing the data. Users can eaily get an overview of available datasets, discover datasets of interest, and load them for analysis.\n",
+    "- **Access control and logging**: You can limit who is able to access these datasets, and track when datasets are being accessed - keeping the data private, and providing metrics that can inform data management (e.g., by removing little-used datasets).\n",
+    "- **Combine individual data assets**: Combine individual data assets that are similar into a larger object (e.g., combining individual time snapshots into a larger data cube)\n",
+    "\n",
+    "There are a bunch of solutions available in Xarray that meet some of these needs - we'll broadly call these \"data catalogs\". These solutions differ in functionality, and the correct data cataloguing solution depends on your needs as an institute.\n",
+    "\n",
+    "If you feel there are data cataloguing tools missing from this page, please submit a PR.\n",
+    "\n",
+    "## intake v2\n",
+    "\n",
+    "Intake is a lightweight Python package which allows one to specify data catalogues (or a hierarchy of catalogues) via YAML files. These YAML files describe:\n",
+    "\n",
+    "- how to load the dataset. This is done via \"readers\", which support different file formats, and support lazily defining additional transformations to the dataset using Dask\n",
+    "- additional metadata for the dataset\n",
+    "\n",
+    "Uploading the catalogue file to a Git forge like GitHub allows for versioning of this file.\n",
+    "\n",
+    "This catalogue can be browsed in Python, or via third-party tools which parse the datalog, allowing the user to select and open a dataset. This tool has been adopted in the geosciences community for working with Xarray datasets, but is also flexible to work with other dataset types.\n",
+    "\n",
+    "There is no separate authentication layer in Intake, hence access control and logging depends entirely on the way the data is accessed. If the dataset is loaded from a file location on a mounted drive, then you need to have access to that mounted drive. If the dataset is located in an S3 bucket, you need to have the appropriate access credentials for the resource.\n",
+    "\n",
+    "Intake was originally developed within Anaconda by the team behind Dask.\n",
+    "\n",
+    "### Resources\n",
+    "\n",
+    "Project Pythia has a [cookbook on Intake](https://projectpythia.org/intake-cookbook/notebooks/creating-catalogs/).\n",
+    "\n",
+    "### Example code\n",
+    "\n",
+    "Here we work based off [an Intake tutorial on easy.gems!](https://easy.gems.dkrz.de/Processing/intake.html) from the DKRZ (German Climate Supercomputing Center). Note that the tutorial is based on v1 of Intake (which also uses `intake-xarray`).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Browse an existing catalogue\n",
+    "\n",
+    "import intake\n",
+    "\n",
+    "cat = intake.open_catalog(\"https://data.nextgems-h2020.eu/online.yaml\")\n",
+    "# TIP: Its just a YAML file - open it in your browser and you'll see!\n",
+    "\n",
+    "# view the available subcatalogues or datasets\n",
+    "list(cat)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Access catalogue elements using `[\"attr\"]` or `.attr` notation\n",
+    "# NOTE: The `.attr` notation only works if the entry is a valid Python attribute name - which the entry might not be\n",
+    "\n",
+    "# Let's see what the `tutorial` item is...\n",
+    "cat[\"tutorial\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# it's a subcatalogue!\n",
+    "\n",
+    "print(list(cat[\"tutorial\"]))\n",
+    "\n",
+    "# Let's look at an entry within...\n",
+    "cat['tutorial']['ICON.native.2d_PT6H_inst']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's open it as an xarray dataset\n",
+    "\n",
+    "ds = cat['tutorial']['ICON.native.2d_PT6H_inst'].to_dask()\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5",
+   "metadata": {},
+   "source": [
+    "That's it for the example! Feel free to poke around further to see what other datasets are available in this catalogue, or to see which other catalogues are available online."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6",
+   "metadata": {},
+   "source": [
+    "## STAC\n",
+    "\n",
+    "> SpatioTemporal Asset Catalogs\n",
+    "> The STAC specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered.\n",
+    "> -stacspec.org\n",
+    "\n",
+    "Let's set the scene a bit: During satellite data processing, many image files (predominantly TIFF) get generated. Each of these files can have different spatial and temporal extents, and over time these individual files can number in the millions, containing terrabytes of data in total. When doing analyses on these files, the first thing one needs to know is whether the file is even relevant to the analysis at hand (i.e., at least \"what is the spatial and temporal extent of the file?\"), however downloading the file just to check this metadata within is very wasteful. This is the context from which the STAC project was born.\n",
+    "\n",
+    "The goal of the STAC project is to make geospatial data more accessible by providing a specification to describe assets and their metadata. Adding this metadata alongside the asset (or lifting it out above the asset itself) allows:\n",
+    "\n",
+    "- The combining of assets into larger collections representing a larger dataset (e.g., a dataset with multiple time slices)\n",
+    "- For more efficient workflows, as the whole asset doesn't need to be downloaded to check the metadata.\n",
+    "\n",
+    "\n",
+    "Within the STAC project is:\n",
+    "\n",
+    "- **The STAC Specification**: A set of specifications describing (from most atomic to least) STAC Items, STAC Catalogs, and STAC collections. There is also a spec for the STAC API.\n",
+    "- **Tooling for working with STAC**: More on this just now\n",
+    "- **The community**\n",
+    "\n",
+    "There are various tools in the STAC ecosystem. [`odc-stac`](https://github.com/opendatacube/odc-stac), [`stacstac`](https://github.com/gjoseph92/stackstac) and [`xpystac`](https://github.com/stac-utils/xpystac) are all proojects that llow the opening of STAC items as Xarray datasets.\n",
+    "\n",
+    "### More resources\n",
+    "\n",
+    "- https://stacspec.org/en\n",
+    "- https://stacindex.org/\n",
+    "\n",
+    "### Example using odc-stac\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7",
+   "metadata": {},
+   "source": [
+    "## Commercial offering - Arraylake\n",
+    "\n",
+    "There are also commercial offerings available that have data cataloguing functionality.\n",
+    "\n",
+    "Arraylake is Earthmover's managed cloud platform built on top of the open-source Icechunk data format\n",
+    "(where Icechunk is a project offering versioning of Zarr-like datasets).\n",
+    "\n",
+    "See [here](https://icechunk.io/en/latest/arraylake/) for a full comparison between Icechunk and Arraylake.\n",
+    "\n",
+    "A notable feature is that Arraylake integrates with [Earthmover's data marketplace](https://www.earthmover.io/marketplace/),\n",
+    "powering data discovery and allowing for the dissemination of datasets (or subsets of datasets) to other interested users.\n",
+    "\n",
+    "\n",
+    "\n",
+    "## More resources\n",
+    "\n",
+    "https://guide.cloudnativegeo.org/cookbooks/zarr-stac-report/data-consumers/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/pixi.lock b/pixi.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -105,3 +105,5 @@ flox = ">=0.10.4,<0.11"
 numbagg = ">=0.9.0,<0.10"
 rich = ">=14.0.0,<15"
 jupyterlab_vim = ">=4.1.4,<5"
+intake = "<2"
+intake-xarray = "<2"