Data Engineering Case Study

Automated ELT
CNPJ Cloud Sync

A data pipeline that orchestrates the ingestion of CNPJs Information from BrasilAPI directly to a Bucket on Google Cloud.

Autonomous
Daily Execution at morning
Hive-Style
Partitioning
Python | GCS | GitHub Actions
Main Tech Stack

Pipeline Architecture

Ingestion
Source: Brasil API
Action: Python Script (buscar_cnpj)
Format: RAW JSON
Landing Zone
Store: GCS Bucket
Path: /raw/cnpj/
Partitioned: cnpj_YYYY-MM-DD
Processing & ETL
Engine: GitHub Actions + Pandas
Logic: Flattening (QSA/CNAE)
Type Cast: datetime64[us]
Output: Parquet
Analytics Zone
Silver: GCS Parquet Files
Gold: BigQuery Tables
Goal: Ready for BI & Monitoring

End-to-end automated pipeline: From BrasilAPI to GCS and BigQuery

Extraction

Targeted API Ingestion

Python-based engine designed to consume the BrasilAPI, handling specific registration data for 55 CNPJs.

  • Request Management: Robust retry logic and error handling for API resilience.
  • Local Isolation: Environment managed via Python venv to ensure dependency consistency.
  • Raw Capture: Data extracted in its original JSON format to preserve the source of truth.
Processing

Refining Raw to Silver

Implementation of data engineering best practices to transform unstructured data into analytics-ready assets.

  • Flattening: Normalizing nested JSON structures (Partners/QSA and CNAEs) into tabular format.
  • Standardization: Converting timestamps to datetime64[us] for BigQuery compatibility and São Paulo timezone alignment.
  • Format Optimization: Transitioning from JSON to Parquet for better compression and query performance.
Cloud Loading

Partitioning & Data Lake Hydration

Automated loading of processed files into Google Cloud Storage with a focus on governance.

  • Hive Partitioning: Organizing the Bucket by year/month/day to optimize scan costs.
  • Security: Integration via GCP Service Accounts and GitHub Secrets for secure key management.
  • Integrity: Ensuring the "Silver" layer is strictly typed before the BigQuery load.
Automation

GitHub Actions Workflow

Deployment of a CI/CD pipeline that turns the local project into an autonomous serverless engine.

  • Scheduling: Fully automated runs daily at 08:50 AM BRT via Cron expressions.
  • Environment Specs: Automated setup of the virtual environment and requirements on ephemeral runners.
  • Hands-off Ops: Zero manual intervention from extraction to GCS upload.
Next Steps

Quality Metrics & Visualization

Finalizing the pipeline with data health monitoring and business-ready dashboards.

  • Data Quality: Implementation of Completeness and Validity checks (Data Governance).
  • Looker Studio: Planned integration for real-time monitoring of ingestion health and CNPJ statuses.
  • Scalability: Modular code ready for migration to Apache Airflow.

See the repository