Skip to content

Working with Parquet Outputs

The cross-sensor-cal pipeline writes Parquet files for each ENVI product it generates. These tables contain one row per pixel and are optimized for high-performance analytics with DuckDB, pandas, or xarray.


Why Parquet?

  • Supports efficient columnar reads
  • Compresses well for large datasets
  • Easily queryable using SQL (DuckDB)
  • Allows out-of-core or streaming access

File structure

Typical Parquet file naming:

_brdfandtopo_corrected.parquet _landsat_convolved.parquet *_merged_pixel_extraction.parquet

Each file contains columns for:

  • reflectance values
  • band metadata
  • masks
  • pixel coordinates
  • optional ancillary variables

Quick preview using DuckDB

```python import duckdb

duckdb.query(""" SELECT * FROM '..._brdfandtopo_corrected.parquet' LIMIT 5 """).df() DuckDB provides efficient SQL queries without needing to load the entire dataset into memory. Checking dimensions duckdb.query(""" SELECT COUNT(*) AS nrows FROM '..._landsat_convolved.parquet' """).df() Loading with pandas import pandas as pd df = pd.read_parquet("..._merged_pixel_extraction.parquet") df.head() Use with caution for large flight lines. Loading with xarray Parquet → xarray workflows work best after pivoting or aggregating data. For full spatial cubes, ENVI files remain easier to load. Streaming large files DuckDB can scan files lazily: duckdb.query(""" SELECT AVG(NIR), AVG(Red) FROM '..._landsat_convolved.parquet' """).df() This avoids loading the full table. Next steps CLI usage Pipeline outputs