Home ETL for multiband data
Post
Cancel

ETL for multiband data

Overview of AWS pipeline for processing TIF images for downstream analytics.

AWS Pipeline for Processing TIF Images (w/ Vegetation Masking)

  1. Ingest
    • Amazon S3: Store raw TIF images in a designated input bucket.
    • AWS Lambda: Trigger the processing pipeline when new TIF is uploaded to the S3 bucket
  2. Preprocess
    • AWS Lambda: Read the TIF images using libraries like Rasterio
    • Data Normalization: Apply necessary corrections (atmospheric, normalize bands)
  3. NDVI/NDWI Calculation
    • Use the Near-Infrared (NIR) and Red bands to compute NDVI
  4. Vegetation Masking
    • The Lambda function fetches the current threshold dynamically (API Gateway) and applies masking
    • The threshold used will vary from site to site
      • Pixels above the threshold are classified as vegetation
      • Generate a binary mask to overlay on the original image
  5. Data Transformation
    • Convert the masked TIF data into Parquet format using PyArrow for each (row,col)
    • Also record the (x,y) coordinates using the TIF file’s CRS and the indices
  6. AWS Glue Data Crawler
    • Scans processed Parquet files and updates Glue Catalog
    • Schema can change– i.e., new features added
  7. AWS Glue ETL Jobs (Further transformation)
    • Further transformation by merging elevation and textures from associated TIF file
  8. Storage (Load)
    • Load back to S3 is a “processed” subfolder
  9. Analyst Interaction
    • AWS API Gateway + Lambda: Provide an API where analysts/modelers can update the NDVI threshold via POST request
  10. Monitoring
    • AWS CloudWatch: Monitor Lambda functions
This post is licensed under CC BY 4.0 by the author.
Trending Tags