cross-sensor-cal

Stage 03 Pixel Extraction

DO NOT EDIT OUTSIDE MARKERS

Overview

At this stage you sample pixels from the sorted scenes and write the values to a tabular file. The table becomes the input to spectral unmixing and other downstream analyses.

Sampling rules

Define a consistent random seed so repeated runs draw the same pixels.
Sample within each land‐cover class or tile to avoid geographic bias.
Drop any pixel flagged by a quality mask or falling outside the region of interest.

Handling nodata and masks

Treat nodata values (-9999 by default) as missing and skip those records.
Apply cloud, shadow, and water masks before sampling so invalid pixels never reach the table.
Keep a boolean is_masked column to track which values were rejected.

Tile vs full scene

Tiles scale better for large mosaics and let you parallelize extraction.
Full scenes are faster when memory allows and ensure contiguous coverage. Choose the approach that matches your hardware and scene size; the output format is identical.

Output tables

Each row represents one pixel. Columns typically include scene_id, tile_id, x, y, band values, and is_masked. Write tables as CSV for quick inspection or Parquet for efficient storage. Partition by scene and tile so you can read subsets without loading the whole dataset.

Memory tips

Process one tile at a time and release arrays with del to free RAM.
When writing CSV, stream rows with a generator instead of building a huge DataFrame.
Prefer Parquet with compression to reduce disk use and load times.

Quick integrity checks

Confirm row counts match the number of valid pixels expected per tile.
Scan for remaining nodata values: rg -n "-9999" sample.csv.
Plot a histogram of one band to detect obvious outliers before moving on.

Next steps

Continue to Stage 04 to build the spectral library from the extracted pixels.

Last updated: 2025-08-18