Spreadsheet parsing & Automation

Python, the swiss army knife of all languages.

Feb 20, 2025

After I joined Blue Hex Software in spring of 2021, the first ever client was from Miami, Florida. An Insurance provider application connected with us and shared us their problem statement.

Problem Statement

Their business revolves around selling customized Insurance plans to their customers, for which they have crafted a in-house solution that helps them calculate the right price for end customer.

The problem you looking for ? Well, the issue was that the in-house solution was in the face of a Mircosoft Excel Sheet. Now you get it right!

The solution they were looking for a way to migrate these calculations and business logic embedded inside the excel to a modern application where customer can be onboarded easily and can be given the same value, adding a MIS layer on top of it for them to monitor the growth.

Solution

We proposed a 3-tier architecture that can grow with their business and can be maintained easily with minimal support.

Tech Stack

Django (Backend)
NextJS (Frontend)
Postgres (Database, served on AWS RDS)
Docker (Containerisation)
RabbitMQ (Message Broker)
Redis (Cache)
Static Files (AWS S3)
GitHub Action (CI/CD with AWS ECS or EC2)

With that said, we started the development.

Data Model

The schema is very simple and straightforward, given there are Users, Customers, Plans, Insurance Companies, Price, Family, Invoice, and Payments. This is very overview of how the schema looked like, without exposing the actual fields and meta data kept to store the business logic.

The Migration

You see when we started the migration, we thought it would be simple. Like, how bad it can be to start with. Well, it was bad estimation from our end.

The spreadsheet was horrific, with hidden columns and conditional statements and had hundres of different worksheet. This required a severe R&D on the sheets since a basic Pandas or Openpyxl won’t work.

The Parsing Engine and MapReduce

About MapReduce

MapReduce is a programming model used for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It's commonly used in big data processing, especially in Hadoop. Example

from collections import defaultdict
from multiprocessing import Pool

# Sample data: List of lines from a document
data = [
    "This is a sample document",
    "This document is a sample"
]

def map_function(line):
    words = line.split()
    return [(word, 1) for word in words]

def reduce_function(mapped_data):
    word_count = defaultdict(int)
    for word, count in mapped_data:
        word_count[word] += count
    return word_count

# Map step: Apply the map function to each line
with Pool() as pool:
    mapped_data = pool.map(map_function, data)
    
# Flatten the list of lists
mapped_data = [item for sublist in mapped_data for item in sublist]

# Reduce step: Apply the reduce function to combine results
reduced_data = reduce_function(mapped_data)

print(reduced_data)

We used similar concept and created an algorithm that helped us generate the artifacts and then map it with the found results. And then a loading script that would load the content to the Postgres database selecting right the table name. The function was not 100% perfect but it did the job.

The migration completed well within the deadline with less than 5% dataloss. The dataloss recovery was not accessible via the functions we wrote, so we resided on using some manual data entry and then creating a seeder scripts that would help the client in future if they want to move from AWS RDS or Postgres to any other DB in future.

Final Thoughts

All solutions are not found on the internet or threads of Stack Overflow. Sometimes, we have to think outside the box and create our own solution and then optimise it.

Till next time, happy coding :)

Abhishek Archives

Discussion about this post