Working with Parquet Outputs¶
The cross-sensor-cal pipeline writes Parquet files for each ENVI product it generates. These tables contain one row per pixel and are optimized for high-performance analytics with DuckDB, pandas, or xarray.
Why Parquet?¶
- Supports efficient columnar reads
- Compresses well for large datasets
- Easily queryable using SQL (DuckDB)
- Allows out-of-core or streaming access
File structure¶
Typical Parquet file naming:
_brdfandtopo_corrected.parquet _landsat_convolved.parquet *_merged_pixel_extraction.parquet
Each file contains columns for:
- reflectance values
- band metadata
- masks
- pixel coordinates
- optional ancillary variables
Quick preview using DuckDB¶
```python import duckdb
duckdb.query(""" SELECT * FROM '..._brdfandtopo_corrected.parquet' LIMIT 5 """).df() DuckDB provides efficient SQL queries without needing to load the entire dataset into memory. Checking dimensions duckdb.query(""" SELECT COUNT(*) AS nrows FROM '..._landsat_convolved.parquet' """).df() Loading with pandas import pandas as pd df = pd.read_parquet("..._merged_pixel_extraction.parquet") df.head() Use with caution for large flight lines. Loading with xarray Parquet → xarray workflows work best after pivoting or aggregating data. For full spatial cubes, ENVI files remain easier to load. Streaming large files DuckDB can scan files lazily: duckdb.query(""" SELECT AVG(NIR), AVG(Red) FROM '..._landsat_convolved.parquet' """).df() This avoids loading the full table. Next steps CLI usage Pipeline outputs