i7aof.io, i7aof.io_zarr and i7aof.download¶
Purpose: Shared I/O helpers for NetCDF writing, Zarr finalization, and robust file downloads.
Public Python API (by module)¶
Import paths and brief descriptions by module:
Module:
i7aof.ioread_dataset(): Open a dataset with package defaults and normalized CF time metadata (cftime decoding, matching units/calendar ontime/time_bnds).write_netcdf(): Write an xarray.Dataset to NetCDF with per-variable fill values, optional compression, time encoding normalization, and optional progress bar; supports a conversion path for 64-bit NetCDF3 (CDF5).
Note
Why use i7aof.io.write_netcdf() instead of xarray.Dataset.to_netcdf?
Typed fill values, not NaNs: xarray’s default can leave NaNs for missing values; many tools (e.g., NCO) expect a typed
_FillValue. This wrapper maps NumPy dtypes to NetCDF fill values (vianetCDF4.default_fillvals), skips string fills, and only writes_FillValuewhen a variable actually contains NaNs (unless explicitly directed).Per-variable control: enable/disable
_FillValueand compression for all variables or a selected list via simple flags.Unlimited time by default: automatically marks
timeas an unlimited dimension when present (and clears unlimited dims when not), improving compatibility with tools that append or concatenate over time.Deterministic CF-time encoding: normalizes
time/time_bndsencodings to useunits='days since 1850-01-01', setscalendar, and ensures numeric dtype for consistent serialization across backends.Reliable CDF5 output:
format='NETCDF3_64BIT_DATA'is inconsistently supported by backends. Here, data are written as NETCDF4 and converted to CDF5 using NCO (ncks -5), then the temp file is removed. Ifengine='scipy'is requested for this path, it is coerced tonetcdf4to avoid failures.Better user experience for large writes: optional Dask progress bar without changing user code.
Note within a note: CDF5 conversion uses ncks (NCO). In the conda-forge
environment defined by dev-spec.txt, NCO is included and available on PATH.
Module:
i7aof.io_zarrappend_to_zarr(): Append one or more chunked Datasets to a temporary Zarr store, creating it on the first write; idempotent when a previous run already completed the same segment(s).finalize_zarr_to_netcdf(): Open the Zarr store (non-consolidated), optionally postprocess the Dataset, then write the final NetCDF atomically and clean up the Zarr store.
Module:
i7aof.downloaddownload_file(): Download a single file to a directory or explicit path, with progress bar.download_files(): Download many files from a base URL to a directory, preserving subpaths.
Required config options¶
None. Functions accept explicit arguments and use no global config.
Outputs¶
NetCDF files written via
i7aof.io.write_netcdf()to the provided filename (may use a temporary file during CDF5 conversion).When using
i7aof.io_zarr: a temporary Zarr store during chunked appends, and a single final NetCDF produced at finalize time. The Zarr store is removed after a successful finalize.Downloaded files saved to requested paths; parent directories are created.
Data model¶
Fill values:
_FillValuedecisions are made per variable, using NumPy-dtype → NetCDF mappings (string types are skipped). Behavior is controlled by thehas_fill_valuesargument:True: apply_FillValueto all variables using type-appropriate defaultsFalse: explicitly suppress_FillValue(setsNoneto disable backend defaults)list[str]: apply only to the named variablesNone(default): lazily scan each variable for missing values and set_FillValueonly when needed
Compression: per-variable compression is controlled by the
compressionargument with the same forms as above (True/False/list[str]/None). Default compression options (when enabled as a boolean) are:{'zlib': True, 'complevel': 4, 'shuffle': True}. If compression is requested and no engine is specified,h5netcdfis preferred. Thescipyengine does not support compression and will be ignored with a warning.Time encoding:
write_netcdfnormalizes CF-time metadata fortimeandtime_bndsto use numeric days since1850-01-01with a declaredcalendar(defaulting toproleptic_gregorian), clears conflicting attrs, and sets a numeric dtype to ensure consistent serialization.Unlimited dimension: when a
timedimension exists, it is marked as unlimited; otherwise, any unlimited dims are cleared.CDF5 conversion: for
format='NETCDF3_64BIT_DATA', the dataset is written as NETCDF4 then converted usingncks -5, and the temporary file is removed.Zarr workflow:
i7aof.io_zarrappends chunked results to a Zarr store and later finalizes to a single NetCDF. An internal ready marker file (.i7aof_zarr_ready) inside the store enables idempotent reruns (skip appends if Zarr is already ready). Finalization writes toout.nc.tmpfirst and atomically moves it toout.ncon success, then removes the Zarr store and any legacy.completemarker.
Runtime and external requirements¶
Core:
xarray,netCDF4,numpy,dask(for the optional progress bar).Tools:
ncksfrom NCO if converting toNETCDF3_64BIT_DATA.Zarr:
zarr(via xarray’s Zarr backend) for append/finalize workflows.Downloads:
requests,tqdmfor progress.For the authoritative conda-forge environment, see
dev-spec.txt(notepyproject.tomllists a PyPI-only subset).
Usage¶
import xarray as xr
from i7aof.io import write_netcdf
ds = xr.Dataset({'a': ('x', [1, 2, 3])})
write_netcdf(ds, 'out.nc')
from i7aof.download import download_file
download_file('https://example.com/file.txt', 'data/', quiet=True)
Per-variable fill values and compression¶
import xarray as xr
from i7aof.io import write_netcdf
ds = xr.tutorial.load_dataset('air_temperature').isel(time=slice(0, 12))
# Apply fill values to only selected variables and compress those variables
write_netcdf(
ds,
'air.nc',
has_fill_values=['air'], # apply _FillValue only to 'air'
compression=['air'], # compress only 'air'
compression_opts={'zlib': True, 'complevel': 4, 'shuffle': True},
)
# Enable defaults for all variables
write_netcdf(
ds,
'air_all.nc',
has_fill_values=True,
compression=True, # uses default options
)
Zarr append and atomic finalize¶
import xarray as xr
from i7aof.io_zarr import append_to_zarr, finalize_zarr_to_netcdf
zstore = 'tmp.zarr'
first = True
for chunk in range(3):
ds = xr.Dataset({'x': ('t', [chunk])}, coords={'t': [chunk]})
first = append_to_zarr(ds=ds, zarr_store=zstore, first=first, append_dim='t')
def post(ds: xr.Dataset) -> xr.Dataset:
ds.attrs['history'] = 'finalized'
return ds
finalize_zarr_to_netcdf(
zarr_store=zstore,
out_nc='final.nc',
postprocess=post,
# pass-through to write_netcdf:
compression=True,
)
Internals (for maintainers)¶
read_datasetopens datasets withdecode_times=CFDatetimeCoder(use_cftime=True)by default and normalizestime/time_bndsmetadata.write_netcdfbuilds an encoding dict over all data variables and coords, applies_FillValueand compression decisions, normalizes time encodings, and manages unlimited dims.When converting to
NETCDF3_64BIT_DATA, a temporary NETCDF4 file is created and NCO is invoked withncks -O -5to produce the final output.i7aof.io_zarrwrites a hidden.i7aof_zarr_readymarker inside the Zarr store upon successful open to enable idempotent reruns, then writes NetCDF toout.nc.tmpand atomically moves it toout.ncon success.Downloads stream content in 1KB chunks and update a
tqdmprogress bar.
Edge cases / validations¶
Fill values: when
has_fill_values is None, variables are lazily scanned for missing values and_FillValueis only written when necessary; string dtypes intentionally have no_FillValue.Compression: if
engine='scipy'is selected, compression directives are ignored with a warning (backend limitation). When compression is requested and engine is unspecified,h5netcdfis preferred.Time dtype: writing with
numpy.datetime64time/time_bndsis rejected; useread_dataset(cftime) or numeric CF time with supported units ('days since'or'seconds since').If
engine='scipy'is requested for a CDF5 conversion path, it is forced tonetcdf4to avoid incompatibilities.Zarr:
append_to_zarris idempotent—skips when a segment or the entire store is already complete/ready. Finalization always removes the Zarr store and cleans any legacy external.completemarker.Download functions do nothing if destination files exist unless
overwrite=Trueis passed.
Extension points¶
Add retry/backoff to downloads; allow custom chunk sizes.
Make compression options configurable globally; add per-variable chunk/codec control if needed.