Overview
This page summarizes the end-to-end ETL pipeline for job postings: scraping raw data, cleaning and normalization, and computing high-level metrics. Everything here is static and rendered in the browser — no backend required.
Architecture
flowchart LR
A[Scrape Jobs Lambda\n`src/ingestion/scrape_jobs_lambda.py`] -->|SerpAPI JSON| B[(S3 Raw Bucket\n`demo/*_RAW.json`)]
B --> C[Clean Jobs Lambda\n`src/cleaning/clean_jobs_lambda.py`]
C --> C1[Text Cleaning\n`text_cleaning.py`]
C --> C2[Salary Parse\n`salary_parser.py`]
C --> C3[Skill Extract\n`skill_extractor.py`]
C1 --> D[(S3 Clean Bucket\n`demo/*_CLEAN.json`)]
C2 --> D
C3 --> D
D --> E[Notebook Analysis\n`notebooks/analysis.ipynb`]
D --> F[Static Showcase (this page)]
Implementation refs: `src/ingestion/`, `src/cleaning/`, `src/utils/s3_io.py`.
Process
- Ingestion: Scrape job listings via SerpAPI and store raw JSON.
- Cleaning: Normalize fields, remove boilerplate, and standardize locations.
- Parsing & Enrichment: Extract salary ranges and skills from text.
- Outputs: Save cleaned JSON for analysis and reporting.
Computed Insights
Top Skills
Salary Distribution
Top Locations
Data Coverage
Before / After
Examples comparing raw vs cleaned records for a few listings.
Code & Data
- GitHub Repository
- Cleaned sample:
frontend/data/clean.json - Raw sample:
frontend/data/raw.json