Case Study - CNPJ Data Pipeline

Extraction

01 — Data Sourcing

Python-based engine designed to consume the BrasilAPI, handling specific registration data for 55 CNPJs.

Request Management: Robust retry logic and error handling for API resilience.
Local Isolation: Environment managed via Python venv to ensure dependency consistency.
Raw Capture: Data extracted in its original JSON format to preserve the source of truth.

Processing

02 — Medallion Architecture

Implementation of data engineering best practices to transform unstructured data into analytics-ready assets.

Flattening: Normalizing nested JSON structures (Partners/QSA and CNAEs) into tabular format.
Standardization: Converting timestamps to datetime64[us] for BigQuery compatibility and São Paulo timezone alignment.
Format Optimization: Transitioning from JSON to Parquet for better compression and query performance.

Cloud Loading

03 — Cloud Storage & GCS

Automated loading of processed files into Google Cloud Storage with a focus on governance.

Hive Partitioning: Organizing the Bucket by year/month/day to optimize scan costs.
Security: Integration via GCP Service Accounts and GitHub Secrets for secure key management.
Integrity: Ensuring the "Silver" layer is strictly typed before the BigQuery load.

Automation

04 — Orchestration

Deployment of a CI/CD pipeline that turns the local project into an autonomous serverless engine.

Scheduling: Fully automated runs daily at 08:50 AM BRT via Cron expressions.
Environment Specs: Automated setup of the virtual environment and requirements on ephemeral runners.
Hands-off Ops: Zero manual intervention from extraction to GCS upload.

Next Steps

05 — Observability & Impact

Finalizing the pipeline with data health monitoring and business-ready dashboards.

Data Quality: Implementation of Completeness and Validity checks (Data Governance).
Looker Studio: Planned integration for real-time monitoring of ingestion health and CNPJ statuses.
Scalability: Modular code ready for migration to Apache Airflow.

Automated ELT
CNPJ Cloud Sync