ETL Showcase: From Scrape ➜ Clean ➜ Insight

A static, no-backend walkthrough of the pipeline that ingests job listings, cleans and enriches them, and surfaces key insights — designed for portfolio viewing.

Overview

This page summarizes the end-to-end ETL pipeline for job postings: scraping raw data, cleaning and normalization, and computing high-level metrics. Everything here is static and rendered in the browser — no backend required.

Architecture

flowchart LR
    A[Scrape Jobs Lambda\n`src/ingestion/scrape_jobs_lambda.py`] -->|SerpAPI JSON| B[(S3 Raw Bucket\n`demo/*_RAW.json`)]
    B --> C[Clean Jobs Lambda\n`src/cleaning/clean_jobs_lambda.py`]
    C --> C1[Text Cleaning\n`text_cleaning.py`]
    C --> C2[Salary Parse\n`salary_parser.py`]
    C --> C3[Skill Extract\n`skill_extractor.py`]
    C1 --> D[(S3 Clean Bucket\n`demo/*_CLEAN.json`)]
    C2 --> D
    C3 --> D
    D --> E[Notebook Analysis\n`notebooks/analysis.ipynb`]
    D --> F[Static Showcase (this page)]
        

Implementation refs: `src/ingestion/`, `src/cleaning/`, `src/utils/s3_io.py`.

Process

  1. Ingestion: Scrape job listings via SerpAPI and store raw JSON.
  2. Cleaning: Normalize fields, remove boilerplate, and standardize locations.
  3. Parsing & Enrichment: Extract salary ranges and skills from text.
  4. Outputs: Save cleaned JSON for analysis and reporting.

Computed Insights

Top Skills

Salary Distribution

Top Locations

    Data Coverage

      Before / After

      Examples comparing raw vs cleaned records for a few listings.

      Code & Data