-
Notifications
You must be signed in to change notification settings - Fork 120
Add data cataloging tutorial #359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
VeckoTheGecko
wants to merge
2
commits into
main
Choose a base branch
from
push-tswyomyqkutn
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "0", | ||
| "metadata": { | ||
| "vscode": { | ||
| "languageId": "plaintext" | ||
| } | ||
| }, | ||
| "source": [ | ||
| "# Data cataloguing for Xarray\n", | ||
| "\n", | ||
| "**Goals:** At the end of this tutorial, you'll have an overview of what data cataloging is, why it is done, what tools are available. You'll also know how to open and browse some data catalogues.\n", | ||
| "\n", | ||
| "> A data catalog is a detailed inventory of data assets within an organization. It helps users easily discover, understand, manage, curate and access data. -[IBM](https://www.ibm.com/think/topics/data-catalog)\n", | ||
| "\n", | ||
| "Within an organisation, you may work with a lot of different datasets. As someone managing this data, you might want the following things:\n", | ||
| "\n", | ||
| "- **An organized, easily browsable collection**: By easing discovery and loading of your datasets you can reduce the friction for those analysing the data. Users can eaily get an overview of available datasets, discover datasets of interest, and load them for analysis.\n", | ||
| "- **Access control and logging**: You can limit who is able to access these datasets, and track when datasets are being accessed - keeping the data private, and providing metrics that can inform data management (e.g., by removing little-used datasets).\n", | ||
| "- **Combine individual data assets**: Combine individual data assets that are similar into a larger object (e.g., combining individual time snapshots into a larger data cube)\n", | ||
| "\n", | ||
| "There are a bunch of solutions available in Xarray that meet some of these needs - we'll broadly call these \"data catalogs\". These solutions differ in functionality, and the correct data cataloguing solution depends on your needs as an institute.\n", | ||
| "\n", | ||
| "If you feel there are data cataloguing tools missing from this page, please submit a PR.\n", | ||
| "\n", | ||
| "## intake v2\n", | ||
| "\n", | ||
| "Intake is a lightweight Python package which allows one to specify data catalogues (or a hierarchy of catalogues) via YAML files. These YAML files describe:\n", | ||
| "\n", | ||
| "- how to load the dataset. This is done via \"readers\", which support different file formats, and support lazily defining additional transformations to the dataset using Dask\n", | ||
| "- additional metadata for the dataset\n", | ||
| "\n", | ||
| "Uploading the catalogue file to a Git forge like GitHub allows for versioning of this file.\n", | ||
| "\n", | ||
| "This catalogue can be browsed in Python, or via third-party tools which parse the datalog, allowing the user to select and open a dataset. This tool has been adopted in the geosciences community for working with Xarray datasets, but is also flexible to work with other dataset types.\n", | ||
| "\n", | ||
| "There is no separate authentication layer in Intake, hence access control and logging depends entirely on the way the data is accessed. If the dataset is loaded from a file location on a mounted drive, then you need to have access to that mounted drive. If the dataset is located in an S3 bucket, you need to have the appropriate access credentials for the resource.\n", | ||
| "\n", | ||
| "Intake was originally developed within Anaconda by the team behind Dask.\n", | ||
| "\n", | ||
| "### Resources\n", | ||
| "\n", | ||
| "Project Pythia has a [cookbook on Intake](https://projectpythia.org/intake-cookbook/notebooks/creating-catalogs/).\n", | ||
| "\n", | ||
| "### Example code\n", | ||
| "\n", | ||
| "Here we work based off [an Intake tutorial on easy.gems!](https://easy.gems.dkrz.de/Processing/intake.html) from the DKRZ (German Climate Supercomputing Center). Note that the tutorial is based on v1 of Intake (which also uses `intake-xarray`).\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "1", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Browse an existing catalogue\n", | ||
| "\n", | ||
| "import intake\n", | ||
| "\n", | ||
| "cat = intake.open_catalog(\"https://data.nextgems-h2020.eu/online.yaml\")\n", | ||
| "# TIP: Its just a YAML file - open it in your browser and you'll see!\n", | ||
| "\n", | ||
| "# view the available subcatalogues or datasets\n", | ||
| "list(cat)" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "2", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Access catalogue elements using `[\"attr\"]` or `.attr` notation\n", | ||
| "# NOTE: The `.attr` notation only works if the entry is a valid Python attribute name - which the entry might not be\n", | ||
| "\n", | ||
| "# Let's see what the `tutorial` item is...\n", | ||
| "cat[\"tutorial\"]" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "3", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# it's a subcatalogue!\n", | ||
| "\n", | ||
| "print(list(cat[\"tutorial\"]))\n", | ||
| "\n", | ||
| "# Let's look at an entry within...\n", | ||
| "cat['tutorial']['ICON.native.2d_PT6H_inst']" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "id": "4", | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# Let's open it as an xarray dataset\n", | ||
| "\n", | ||
| "ds = cat['tutorial']['ICON.native.2d_PT6H_inst'].to_dask()\n", | ||
| "ds" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "5", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "That's it for the example! Feel free to poke around further to see what other datasets are available in this catalogue, or to see which other catalogues are available online." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "6", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## STAC\n", | ||
| "\n", | ||
| "> SpatioTemporal Asset Catalogs\n", | ||
| "> The STAC specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered.\n", | ||
| "> -stacspec.org\n", | ||
| "\n", | ||
| "Let's set the scene a bit: During satellite data processing, many image files (predominantly TIFF) get generated. Each of these files can have different spatial and temporal extents, and over time these individual files can number in the millions, containing terrabytes of data in total. When doing analyses on these files, the first thing one needs to know is whether the file is even relevant to the analysis at hand (i.e., at least \"what is the spatial and temporal extent of the file?\"), however downloading the file just to check this metadata within is very wasteful. This is the context from which the STAC project was born.\n", | ||
| "\n", | ||
| "The goal of the STAC project is to make geospatial data more accessible by providing a specification to describe assets and their metadata. Adding this metadata alongside the asset (or lifting it out above the asset itself) allows:\n", | ||
| "\n", | ||
| "- The combining of assets into larger collections representing a larger dataset (e.g., a dataset with multiple time slices)\n", | ||
| "- For more efficient workflows, as the whole asset doesn't need to be downloaded to check the metadata.\n", | ||
| "\n", | ||
| "\n", | ||
| "Within the STAC project is:\n", | ||
| "\n", | ||
| "- **The STAC Specification**: A set of specifications describing (from most atomic to least) STAC Items, STAC Catalogs, and STAC collections. There is also a spec for the STAC API.\n", | ||
| "- **Tooling for working with STAC**: More on this just now\n", | ||
| "- **The community**\n", | ||
| "\n", | ||
| "There are various tools in the STAC ecosystem. [`odc-stac`](https://github.com/opendatacube/odc-stac), [`stacstac`](https://github.com/gjoseph92/stackstac) and [`xpystac`](https://github.com/stac-utils/xpystac) are all proojects that llow the opening of STAC items as Xarray datasets.\n", | ||
| "\n", | ||
| "### More resources\n", | ||
| "\n", | ||
| "- https://stacspec.org/en\n", | ||
| "- https://stacindex.org/\n", | ||
| "\n", | ||
| "### Example using odc-stac\n", | ||
| "\n" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "7", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Commercial offering - Arraylake\n", | ||
| "\n", | ||
| "There are also commercial offerings available that have data cataloguing functionality.\n", | ||
| "\n", | ||
| "Arraylake is Earthmover's managed cloud platform built on top of the open-source Icechunk data format\n", | ||
| "(where Icechunk is a project offering versioning of Zarr-like datasets).\n", | ||
| "\n", | ||
| "See [here](https://icechunk.io/en/latest/arraylake/) for a full comparison between Icechunk and Arraylake.\n", | ||
| "\n", | ||
| "A notable feature is that Arraylake integrates with [Earthmover's data marketplace](https://www.earthmover.io/marketplace/),\n", | ||
| "powering data discovery and allowing for the dissemination of datasets (or subsets of datasets) to other interested users.\n", | ||
| "\n", | ||
| "\n", | ||
| "\n", | ||
| "## More resources\n", | ||
| "\n", | ||
| "https://guide.cloudnativegeo.org/cookbooks/zarr-stac-report/data-consumers/" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "id": "8", | ||
| "metadata": {}, | ||
| "source": [] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 5 | ||
| } |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keewis what do you think would be a good example to show off odc-stac? I was looking at this example notebook "Access Sentinel 2 Data from AWS" on their docs. It seems like a bunch of setup, making we wonder if there's an example we can do better suited for this notebook.