Corim Distribution: Route Optimization Pipeline
Built a Python automation pipeline processing 1,916 real-world customer records for a live distribution company, calculating driving distances and travel times via the Google Maps Distance Matrix API. Engineered robust dirty-data handling for malformed and invalid address entries, exponential backoff retry logic with a 3-attempt ceiling on failed geocoding calls, and a checkpoint-resume pattern writing incremental Excel output every 100 records to prevent data loss and allow mid-run restarts across multi-hour processing jobs.
Pipeline Architecture
Customer records flow in · addresses are normalized · Distance Matrix calls go out in batches · failures retry with backoff · output is checkpointed every 100 records · final result is an Excel-ready distance matrix.
What’s actually happening at each stage
Each stage is explained twice, first for the finance reader, then for the engineer.
1. Dirty-Data Address Normalization
Finance lens
Real-world business data is messy. Distribution customer lists have typos, missing zip codes, abbreviated street names, and entries that look right but aren't. Before any optimization runs, the pipeline cleans every record so downstream API calls don't waste budget on garbage inputs and so unsalvageable records are flagged rather than silently corrupting the result.
Engineering lens
A normalization layer strips typos, expands abbreviations, infers missing fields where possible, and routes unsalvageable records to a separate exception queue rather than failing the whole batch. Validation happens before the first paid API call, not after.
2. Distance Matrix Computation at Scale
Finance lens
For each of the 1,916 customers, the model needs driving distance and travel time to every other relevant location, that's tens of thousands of pairwise queries to Google Maps. The output is the substrate for any route optimization the company runs on top of it (last-mile planning, capacity allocation, contract bid responses).
Engineering lens
Asynchronous batched calls to Google Maps Distance Matrix API. Calls are throttled to stay under quota and batched for cost efficiency, Google charges per element, so packing matters. Results are stored as a sparse matrix in Excel for downstream consumption by non-technical stakeholders.
3. Exponential Backoff & Resilient Retries
Finance lens
APIs fail. Networks flap. Across multi-hour runs, a single transient error shouldn't kill an 8-hour job. The pipeline retries failed calls with progressively longer waits before giving up after 3 attempts, and logs every failure with full context so the operator can review them after the fact instead of losing the data silently.
Engineering lens
Exponential backoff (1s, 2s, 4s) with a hard 3-attempt ceiling. Failures past that are logged with full context (input row, response code, message) to a separate exception table; the run continues. Success/failure rates are tracked across error categories so systemic issues surface quickly.
4. Checkpoint-Resume Pattern
Finance lens
If the run crashes at hour 5, you don't want to start over. Every 100 records, the pipeline saves its progress to disk so it can pick up exactly where it left off, no wasted API calls, no lost work, no second invoice from Google for the same data.
Engineering lens
Incremental Excel writes every 100 records via openpyxl. On restart, the pipeline reads the existing output, resumes from the next unprocessed record, and continues. Idempotent, safe to restart any number of times. The same pattern handles partial-failure runs and operator-induced kills.
Methodology notes
Address normalization runs before any paid API call, dirty inputs are routed to an exception queue rather than corrupting the main run
Asynchronous batched calls to Google Maps Distance Matrix API, throttled under quota and packed for per-element cost efficiency
Exponential backoff (1s / 2s / 4s) with a 3-attempt ceiling; failures are logged with full context to a separate exception table while the run continues
Checkpoint-resume pattern: incremental Excel writes every 100 records via openpyxl; the pipeline is idempotent and safe to restart at any point in a multi-hour run