Zarr Encoding Specification

In implementing support for the Zarr storage format, Xarray developers made some ad hoc choices about how to store NetCDF data in Zarr. Future versions of the Zarr spec will likely include a more formal convention for the storage of the NetCDF data model in Zarr; see Zarr spec repo for ongoing discussion.

First, Xarray can only read and write Zarr groups. There is currently no support for reading / writing individual Zarr arrays. Zarr groups are mapped to Xarray Dataset objects.

Second, from Xarray’s point of view, the key difference between NetCDF and Zarr is that all NetCDF arrays have dimension names while Zarr arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must somehow encode and decode the name of each array’s dimensions.

To accomplish this, Xarray developers decided to define a special Zarr array attribute: _ARRAY_DIMENSIONS. The value of this attribute is a list of dimension names (strings), for example ["time", "lon", "lat"]. When writing data to Zarr, Xarray sets this attribute on all variables based on the variable dimensions. When reading a Zarr group, Xarray looks for this attribute on all arrays, raising an error if it can’t be found. The attribute is used to define the variable dimension names and then removed from the attributes dictionary returned to the user.

Because of these choices, Xarray cannot read arbitrary array data, but only Zarr data with valid _ARRAY_DIMENSIONS or NCZarr attributes on each array (NCZarr dimension names are defined in the .zarray file).

After decoding the _ARRAY_DIMENSIONS or NCZarr attribute and assigning the variable dimensions, Xarray proceeds to [optionally] decode each variable using its standard CF decoding machinery used for NetCDF data (see decode_cf()).

Finally, it’s worth noting that Xarray writes (and attempts to read) “consolidated metadata” by default (the .zmetadata file), which is another non-standard Zarr extension, albeit one implemented upstream in Zarr-Python. You do not need to write consolidated metadata to make Zarr stores readable in Xarray, but because Xarray can open these stores much faster, users will see a warning about poor performance when reading non-consolidated stores unless they explicitly set consolidated=False. See Consolidated Metadata for more details.

As a concrete example, here we write a tutorial dataset to Zarr and then re-open it directly with Zarr:

In [1]: import os

In [2]: import xarray as xr

In [3]: import zarr

In [4]: ds = xr.tutorial.load_dataset("rasm")
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
    173 try:
--> 174     conn = connection.create_connection(
    175         (self._dns_host, self.port), self.timeout, **extra_kw
    176     )
    178 except SocketTimeout:

File /usr/lib/python3/dist-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
     69     return six.raise_from(
     70         LocationParseError("'%s', label empty or too long" % host), None
     71     )
---> 73 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     74     af, socktype, proto, canonname, sa = res

File /usr/lib/python3.12/socket.py:964, in getaddrinfo(host, port, family, type, proto, flags)
    963 addrlist = []
--> 964 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    965     af, socktype, proto, canonname, sa = res

gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:716, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    715 # Make the request on the httplib connection object.
--> 716 httplib_response = self._make_request(
    717     conn,
    718     method,
    719     url,
    720     timeout=timeout_obj,
    721     body=body,
    722     headers=headers,
    723     chunked=chunked,
    724 )
    726 # If we're going to release the connection in ``finally:``, then
    727 # the response doesn't need to know about the connection. Otherwise
    728 # it will also try to release it and we'll have a double-release
    729 # mess.

File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:405, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    404 try:
--> 405     self._validate_conn(conn)
    406 except (SocketTimeout, BaseSSLError) as e:
    407     # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.

File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:1059, in HTTPSConnectionPool._validate_conn(self, conn)
   1058 if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
-> 1059     conn.connect()
   1061 if not conn.is_verified:

File /usr/lib/python3/dist-packages/urllib3/connection.py:363, in HTTPSConnection.connect(self)
    361 def connect(self):
    362     # Add certificate verification
--> 363     self.sock = conn = self._new_conn()
    364     hostname = self.host

File /usr/lib/python3/dist-packages/urllib3/connection.py:186, in HTTPConnection._new_conn(self)
    185 except SocketError as e:
--> 186     raise NewConnectionError(
    187         self, "Failed to establish a new connection: %s" % e
    188     )
    190 return conn

NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f8419d10500>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
File /usr/lib/python3/dist-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    666 try:
--> 667     resp = conn.urlopen(
    668         method=request.method,
    669         url=url,
    670         body=request.body,
    671         headers=request.headers,
    672         redirect=False,
    673         assert_same_host=False,
    674         preload_content=False,
    675         decode_content=False,
    676         retries=self.max_retries,
    677         timeout=timeout,
    678         chunked=chunked,
    679     )
    681 except (ProtocolError, OSError) as err:

File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:800, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    798     e = ProtocolError("Connection aborted.", e)
--> 800 retries = retries.increment(
    801     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    802 )
    803 retries.sleep()

File /usr/lib/python3/dist-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    591 if new_retry.is_exhausted():
--> 592     raise MaxRetryError(_pool, url, error or ResponseError(cause))
    594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)

MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /pydata/xarray-data/raw/master/rasm.nc (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8419d10500>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
Cell In[4], line 1
----> 1 ds = xr.tutorial.load_dataset("rasm")

File /usr/src/packages/BUILD/xarray/tutorial.py:213, in load_dataset(*args, **kwargs)
    176 def load_dataset(*args, **kwargs) -> Dataset:
    177     """
    178     Open, load into memory, and close a dataset from the online repository
    179     (requires internet).
   (...)
    211     load_dataset
    212     """
--> 213     with open_dataset(*args, **kwargs) as ds:
    214         return ds.load()

File /usr/src/packages/BUILD/xarray/tutorial.py:165, in open_dataset(name, cache, cache_dir, engine, **kws)
    162 downloader = pooch.HTTPDownloader(headers=headers)
    164 # retrieve the file
--> 165 filepath = pooch.retrieve(
    166     url=url, known_hash=None, path=cache_dir, downloader=downloader
    167 )
    168 ds = _open_dataset(filepath, engine=engine, **kws)
    169 if not cache:

File /usr/lib/python3/dist-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
    236 if downloader is None:
    237     downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
    241 if known_hash is None:
    242     get_logger().info(
    243         "SHA256 hash of downloaded file: %s\n"
    244         "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
   (...)
    247         file_hash(str(full_path)),
    248     )

File /usr/lib/python3/dist-packages/pooch/core.py:807, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
    803 try:
    804     # Stream the file to a temporary so that we can safely check its
    805     # hash before overwriting the original.
    806     with temporary_file(path=str(fname.parent)) as tmp:
--> 807         downloader(url, tmp, pooch)
    808         hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
    809         shutil.move(tmp, str(fname))

File /usr/lib/python3/dist-packages/pooch/downloaders.py:208, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
    206     output_file = open(output_file, "w+b")
    207 try:
--> 208     response = requests.get(url, **kwargs)
    209     response.raise_for_status()
    210     content = response.iter_content(chunk_size=self.chunk_size)

File /usr/lib/python3/dist-packages/requests/api.py:73, in get(url, params, **kwargs)
     62 def get(url, params=None, **kwargs):
     63     r"""Sends a GET request.
     64 
     65     :param url: URL for the new :class:`Request` object.
   (...)
     70     :rtype: requests.Response
     71     """
---> 73     return request("get", url, params=params, **kwargs)

File /usr/lib/python3/dist-packages/requests/api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File /usr/lib/python3/dist-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File /usr/lib/python3/dist-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File /usr/lib/python3/dist-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    696     if isinstance(e.reason, _SSLError):
    697         # This branch is for urllib3 v1.22 and later.
    698         raise SSLError(e, request=request)
--> 700     raise ConnectionError(e, request=request)
    702 except ClosedPoolError as e:
    703     raise ConnectionError(e, request=request)

ConnectionError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /pydata/xarray-data/raw/master/rasm.nc (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8419d10500>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

In [5]: ds.to_zarr("rasm.zarr", mode="w")
Out[5]: <xarray.backends.zarr.ZarrStore at 0x7f84194dea70>

In [6]: zgroup = zarr.open("rasm.zarr")

In [7]: print(os.listdir("rasm.zarr"))
['longitude', '.zattrs', '.zmetadata', 'latitude', '.zgroup']

In [8]: print(zgroup.tree())
/
 ├── latitude (50,) float64
 └── longitude (50,) float64

In [9]: dict(zgroup["Tair"].attrs)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[9], line 1
----> 1 dict(zgroup["Tair"].attrs)

File /usr/lib/python3/dist-packages/zarr/hierarchy.py:511, in Group.__getitem__(self, item)
    509         raise KeyError(item)
    510 else:
--> 511     raise KeyError(item)

KeyError: 'Tair'