Zarr Encoding Specification¶
In implementing support for the Zarr storage format, Xarray developers made some ad hoc choices about how to store NetCDF data in Zarr. Future versions of the Zarr spec will likely include a more formal convention for the storage of the NetCDF data model in Zarr; see Zarr spec repo for ongoing discussion.
First, Xarray can only read and write Zarr groups. There is currently no support
for reading / writing individual Zarr arrays. Zarr groups are mapped to
Xarray Dataset objects.
Second, from Xarray’s point of view, the key difference between NetCDF and Zarr is that all NetCDF arrays have dimension names while Zarr arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must somehow encode and decode the name of each array’s dimensions.
To accomplish this, Xarray developers decided to define a special Zarr array
attribute: _ARRAY_DIMENSIONS. The value of this attribute is a list of
dimension names (strings), for example ["time", "lon", "lat"]. When writing
data to Zarr, Xarray sets this attribute on all variables based on the variable
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
arrays, raising an error if it can’t be found. The attribute is used to define
the variable dimension names and then removed from the attributes dictionary
returned to the user.
Because of these choices, Xarray cannot read arbitrary array data, but only
Zarr data with valid _ARRAY_DIMENSIONS or
NCZarr attributes
on each array (NCZarr dimension names are defined in the .zarray file).
After decoding the _ARRAY_DIMENSIONS or NCZarr attribute and assigning the variable
dimensions, Xarray proceeds to [optionally] decode each variable using its
standard CF decoding machinery used for NetCDF data (see decode_cf()).
Finally, it’s worth noting that Xarray writes (and attempts to read)
“consolidated metadata” by default (the .zmetadata file), which is another
non-standard Zarr extension, albeit one implemented upstream in Zarr-Python.
You do not need to write consolidated metadata to make Zarr stores readable in
Xarray, but because Xarray can open these stores much faster, users will see a
warning about poor performance when reading non-consolidated stores unless they
explicitly set consolidated=False. See Consolidated Metadata
for more details.
As a concrete example, here we write a tutorial dataset to Zarr and then re-open it directly with Zarr:
In [1]: import os
In [2]: import xarray as xr
In [3]: import zarr
In [4]: ds = xr.tutorial.load_dataset("rasm")
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self)
173 try:
--> 174 conn = connection.create_connection(
175 (self._dns_host, self.port), self.timeout, **extra_kw
176 )
178 except SocketTimeout:
File /usr/lib/python3/dist-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options)
69 return six.raise_from(
70 LocationParseError("'%s', label empty or too long" % host), None
71 )
---> 73 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
74 af, socktype, proto, canonname, sa = res
File /usr/lib/python3.12/socket.py:964, in getaddrinfo(host, port, family, type, proto, flags)
963 addrlist = []
--> 964 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
965 af, socktype, proto, canonname, sa = res
gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
NewConnectionError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:716, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
715 # Make the request on the httplib connection object.
--> 716 httplib_response = self._make_request(
717 conn,
718 method,
719 url,
720 timeout=timeout_obj,
721 body=body,
722 headers=headers,
723 chunked=chunked,
724 )
726 # If we're going to release the connection in ``finally:``, then
727 # the response doesn't need to know about the connection. Otherwise
728 # it will also try to release it and we'll have a double-release
729 # mess.
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:405, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
404 try:
--> 405 self._validate_conn(conn)
406 except (SocketTimeout, BaseSSLError) as e:
407 # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:1059, in HTTPSConnectionPool._validate_conn(self, conn)
1058 if not getattr(conn, "sock", None): # AppEngine might not have `.sock`
-> 1059 conn.connect()
1061 if not conn.is_verified:
File /usr/lib/python3/dist-packages/urllib3/connection.py:363, in HTTPSConnection.connect(self)
361 def connect(self):
362 # Add certificate verification
--> 363 self.sock = conn = self._new_conn()
364 hostname = self.host
File /usr/lib/python3/dist-packages/urllib3/connection.py:186, in HTTPConnection._new_conn(self)
185 except SocketError as e:
--> 186 raise NewConnectionError(
187 self, "Failed to establish a new connection: %s" % e
188 )
190 return conn
NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f8419d10500>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
MaxRetryError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
666 try:
--> 667 resp = conn.urlopen(
668 method=request.method,
669 url=url,
670 body=request.body,
671 headers=request.headers,
672 redirect=False,
673 assert_same_host=False,
674 preload_content=False,
675 decode_content=False,
676 retries=self.max_retries,
677 timeout=timeout,
678 chunked=chunked,
679 )
681 except (ProtocolError, OSError) as err:
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:800, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
798 e = ProtocolError("Connection aborted.", e)
--> 800 retries = retries.increment(
801 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
802 )
803 retries.sleep()
File /usr/lib/python3/dist-packages/urllib3/util/retry.py:592, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
591 if new_retry.is_exhausted():
--> 592 raise MaxRetryError(_pool, url, error or ResponseError(cause))
594 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /pydata/xarray-data/raw/master/rasm.nc (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8419d10500>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
Cell In[4], line 1
----> 1 ds = xr.tutorial.load_dataset("rasm")
File /usr/src/packages/BUILD/xarray/tutorial.py:213, in load_dataset(*args, **kwargs)
176 def load_dataset(*args, **kwargs) -> Dataset:
177 """
178 Open, load into memory, and close a dataset from the online repository
179 (requires internet).
(...)
211 load_dataset
212 """
--> 213 with open_dataset(*args, **kwargs) as ds:
214 return ds.load()
File /usr/src/packages/BUILD/xarray/tutorial.py:165, in open_dataset(name, cache, cache_dir, engine, **kws)
162 downloader = pooch.HTTPDownloader(headers=headers)
164 # retrieve the file
--> 165 filepath = pooch.retrieve(
166 url=url, known_hash=None, path=cache_dir, downloader=downloader
167 )
168 ds = _open_dataset(filepath, engine=engine, **kws)
169 if not cache:
File /usr/lib/python3/dist-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
236 if downloader is None:
237 downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
241 if known_hash is None:
242 get_logger().info(
243 "SHA256 hash of downloaded file: %s\n"
244 "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
(...)
247 file_hash(str(full_path)),
248 )
File /usr/lib/python3/dist-packages/pooch/core.py:807, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
803 try:
804 # Stream the file to a temporary so that we can safely check its
805 # hash before overwriting the original.
806 with temporary_file(path=str(fname.parent)) as tmp:
--> 807 downloader(url, tmp, pooch)
808 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
809 shutil.move(tmp, str(fname))
File /usr/lib/python3/dist-packages/pooch/downloaders.py:208, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
206 output_file = open(output_file, "w+b")
207 try:
--> 208 response = requests.get(url, **kwargs)
209 response.raise_for_status()
210 content = response.iter_content(chunk_size=self.chunk_size)
File /usr/lib/python3/dist-packages/requests/api.py:73, in get(url, params, **kwargs)
62 def get(url, params=None, **kwargs):
63 r"""Sends a GET request.
64
65 :param url: URL for the new :class:`Request` object.
(...)
70 :rtype: requests.Response
71 """
---> 73 return request("get", url, params=params, **kwargs)
File /usr/lib/python3/dist-packages/requests/api.py:59, in request(method, url, **kwargs)
55 # By using the 'with' statement we are sure the session is closed, thus we
56 # avoid leaving sockets open which can trigger a ResourceWarning in some
57 # cases, and look like a memory leak in others.
58 with sessions.Session() as session:
---> 59 return session.request(method=method, url=url, **kwargs)
File /usr/lib/python3/dist-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
584 send_kwargs = {
585 "timeout": timeout,
586 "allow_redirects": allow_redirects,
587 }
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File /usr/lib/python3/dist-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
700 start = preferred_clock()
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
706 elapsed = preferred_clock() - start
File /usr/lib/python3/dist-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
696 if isinstance(e.reason, _SSLError):
697 # This branch is for urllib3 v1.22 and later.
698 raise SSLError(e, request=request)
--> 700 raise ConnectionError(e, request=request)
702 except ClosedPoolError as e:
703 raise ConnectionError(e, request=request)
ConnectionError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /pydata/xarray-data/raw/master/rasm.nc (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f8419d10500>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
In [5]: ds.to_zarr("rasm.zarr", mode="w")
Out[5]: <xarray.backends.zarr.ZarrStore at 0x7f84194dea70>
In [6]: zgroup = zarr.open("rasm.zarr")
In [7]: print(os.listdir("rasm.zarr"))
['longitude', '.zattrs', '.zmetadata', 'latitude', '.zgroup']
In [8]: print(zgroup.tree())
/
├── latitude (50,) float64
└── longitude (50,) float64
In [9]: dict(zgroup["Tair"].attrs)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[9], line 1
----> 1 dict(zgroup["Tair"].attrs)
File /usr/lib/python3/dist-packages/zarr/hierarchy.py:511, in Group.__getitem__(self, item)
509 raise KeyError(item)
510 else:
--> 511 raise KeyError(item)
KeyError: 'Tair'