Job Market Pipeline — ETL Showcase

Overview

This page summarizes the end-to-end ETL pipeline for job postings: scraping raw data, cleaning and normalization, and computing high-level metrics. Everything here is static and rendered in the browser — no backend required.

Architecture

flowchart LR
    A[Scrape Jobs Lambda\n`src/ingestion/scrape_jobs_lambda.py`] -->|SerpAPI JSON| B[(S3 Raw Bucket\n`demo/*_RAW.json`)]
    B --> C[Clean Jobs Lambda\n`src/cleaning/clean_jobs_lambda.py`]
    C --> C1[Text Cleaning\n`text_cleaning.py`]
    C --> C2[Salary Parse\n`salary_parser.py`]
    C --> C3[Skill Extract\n`skill_extractor.py`]
    C1 --> D[(S3 Clean Bucket\n`demo/*_CLEAN.json`)]
    C2 --> D
    C3 --> D
    D --> E[Notebook Analysis\n`notebooks/analysis.ipynb`]
    D --> F[Static Showcase (this page)]

Implementation refs: `src/ingestion/`, `src/cleaning/`, `src/utils/s3_io.py`.

Process

Ingestion: Scrape job listings via SerpAPI and store raw JSON.
Cleaning: Normalize fields, remove boilerplate, and standardize locations.
Parsing & Enrichment: Extract salary ranges and skills from text.
Outputs: Save cleaned JSON for analysis and reporting.

Computed Insights

Top Skills

Salary Distribution

Top Locations

Data Coverage

Before / After

Examples comparing raw vs cleaned records for a few listings.

Code & Data

GitHub Repository
Cleaned sample: frontend/data/clean.json
Raw sample: frontend/data/raw.json