Enterprise Data Governance Without the Vendor Tax: PII Masking at Scale for BFSI
Commercial governance platforms cost ₹2–5 Cr annually and still need heavy customisation for BFSI-specific PII. Here is how to build enterprise-grade governance on open-source foundations.
Data governance has a vendor problem. The commercial platforms — Collibra, Informatica Data Governance, Alation, IBM Knowledge Catalog — are genuinely capable tools. They are also expensive, complex to implement, and built primarily for global enterprise use cases that do not always map cleanly onto the specific requirements of Indian BFSI firms.
The result is a familiar pattern: a BFSI firm invests heavily in a commercial governance platform, spends twelve months in implementation, achieves partial coverage of its actual governance requirements, and then runs a tool that costs ₹2–5 crore annually in licensing for a subset of what it was promised.
There is a better path. Over the past few years, I have built enterprise-grade data governance capability on open-source foundations — including a custom governance frontend — that delivers full BFSI-specific requirements at a fraction of the commercial cost. This article is a candid account of how.
What BFSI Governance Actually Requires
Before talking about solutions, let us be specific about the problem. In a BFSI context, data governance is not an abstract data management exercise. It is driven by concrete regulatory and operational requirements:
- PII identification and classification: Know exactly where personal data — PAN, Aadhaar references, account numbers, mobile numbers, email addresses, date of birth — exists across your entire data estate
- Access control at the data level: Ensure that only authorised roles can access specific data categories, with column-level and row-level granularity
- Dynamic masking: PII should be visible in its raw form only to those with explicit business justification — others see masked values that preserve data utility without exposing sensitive information
- Audit trails: Every access to sensitive data must be logged with sufficient detail for regulatory inspection — who accessed what, when, and why
- Data lineage: Understand where sensitive data originated, how it has been transformed, and where it flows across downstream systems
- Retention enforcement: Regulatory data retention schedules must be enforced programmatically, not managed through manual processes that inevitably have gaps
The Digital Personal Data Protection Act 2023 has materially raised the stakes for BFSI data governance. Firms that previously treated governance as an IT function are now dealing with obligations that carry meaningful penalties and senior management accountability. This has driven renewed focus on governance capability — and exposed how much most firms are managing PII through manual controls that cannot scale.
The Open-Source Governance Stack
A complete open-source governance capability for BFSI is built on three layers: classification, enforcement, and transparency. Each layer has well-established open-source components that, assembled correctly, provide enterprise-grade capability.
Enterprise PII Masking Architecture — Data Flow
End-to-end PII masking architecture — from source ingestion through Apache Ranger's dynamic masking engine to role-differentiated views with immutable audit trails
Layer 1: Data Classification and Cataloguing
Apache Atlas provides the metadata management and governance layer — effectively the registry of what data exists, where it lives, and what it means. Combined with a custom classification schema developed for BFSI-specific PII categories, it becomes the single source of truth for the governance estate.
The classification taxonomy we developed for BFSI includes:
- Identity PII: PAN, Aadhaar reference, passport, driving licence, voter ID
- Financial PII: Account numbers, IFSC codes, demat account numbers, UCC codes
- Contact PII: Mobile, email, residential address, communication preferences
- Transaction sensitive: Trade details, portfolio holdings, advisory interactions
- Derived sensitive: Client-level P&L, risk profiles, behavioural scoring outputs
Classification is automated using pattern matching (regex for structured PII like PAN format) and NLP-based scanning for semi-structured fields. New tables are automatically scanned at onboarding and classified before they are made available for analytical access.
Layer 2: Access Enforcement and Masking
Apache Ranger provides the policy enforcement layer — the system that actually intercepts data access requests and enforces the governance rules. For the lakehouse architecture, Ranger integrates with Trino and Spark to provide column-level and row-level masking at the query level.
The masking logic is applied dynamically — meaning the same table serves different views to different roles without duplicating data:
- Analysts: PAN displayed as XXXXX1234X. Account numbers masked to last 4 digits. Mobile numbers masked to XXXXXX7890.
- Data engineers: Actual values visible for debugging purposes, with all access logged and flagged for periodic review
- Compliance and risk teams: Full access with enhanced audit logging — every row accessed is recorded
- External system integrations: Row-level filtering ensures downstream systems receive only the data they are authorised to process
Layer 3: Audit Trails and Lineage
The audit layer is built on top of Delta Lake's transaction log capabilities, extended with a custom audit schema. Every query that accesses a PII-classified column generates an audit record: user identity, timestamp, query text (parameterised, not raw), data classification categories accessed, and approximate row count touched.
This audit log is stored in an append-only Delta table that even administrators cannot modify — enforced through the storage-level write-once configuration. The result is an audit trail that can be presented to regulators with confidence in its integrity.
The Custom Governance Frontend
The open-source components above — Atlas, Ranger, Delta Lake — are powerful but not designed for use by compliance officers, data stewards, or business stakeholders who are not technically proficient. This is where commercial tools like Collibra genuinely earn their price tag: they provide a polished, accessible interface.
Our solution was a custom web frontend built specifically for the BFSI governance use case. The design principle was simple: every function that a compliance officer, data steward, or auditor needs must be accessible without any knowledge of the underlying technical stack.
The frontend provides:
- Data catalog: Browse and search all tables and columns with their classification labels. Click through to see who has access, when it was last accessed, and the lineage of how it arrived in the lakehouse.
- Access management: Request and approve elevated access to sensitive data categories through a workflow. Access grants are time-limited and require business justification.
- PII heatmap: A visual dashboard showing the concentration of PII across the data estate — which domains have the highest PII density, which teams access it most frequently, where the highest compliance risk sits.
- Audit reports: One-click generation of audit reports for any time period — formatted for regulatory submission or internal compliance review.
- Retention calendar: Visual timeline of data categories against their retention schedules, with automated alerts when data is approaching or past its retention window.
The custom frontend took approximately 3 months to build and represents a one-time capital investment. At current commercial governance platform pricing, that investment pays back in under 18 months in licensing costs alone — before accounting for the superior fit to BFSI-specific requirements.
Integrating AI Access into the Governance Model
The governance architecture faces a new challenge as AI tools enter the data environment. When an analyst uses an AI assistant to query data, the access pattern is different from a direct SQL query — but the governance requirements are identical.
We extended the governance model to cover AI-mediated data access by routing all AI tool requests through the same Ranger policy enforcement layer that governs direct queries. An AI assistant accessing data on behalf of a user inherits exactly the access rights of that user — the masking rules apply, the audit log records the access, and the row-level filters enforce data scope.
This is a non-trivial architecture requirement that most AI implementations overlook. The result, if you do not address it deliberately, is an AI layer that bypasses your governance controls — which is precisely the scenario that creates regulatory exposure.
Implementation Path and Realistic Timelines
A realistic implementation of this governance architecture for a mid-size BFSI firm looks like this:
- 1.Months 1–2: Governance inventory. Catalogue all data assets, identify PII-containing tables and columns, develop the classification taxonomy. This is primarily a business and compliance exercise — not technical. It is also where most firms realise they have far more PII scattered far more widely than they thought.
- 2.Months 3–4: Core infrastructure. Deploy Atlas, Ranger, and integrate with existing query engines. Implement basic masking policies for highest-risk PII categories. Establish audit logging.
- 3.Months 5–6: Custom frontend. Build the governance UI for data stewards and compliance teams. Deploy the access request and approval workflow. Initial training for non-technical governance stakeholders.
- 4.Months 7–9: Full classification coverage. Extend automated classification to the full data estate. Implement retention enforcement. Integrate AI tool access governance.
- 5.Ongoing: Governance operations. Monthly review of access grants, quarterly audit report generation, annual classification review.
“Governance is not a project with a completion date. It is an operational capability. The difference between firms with effective governance and those without is not the sophistication of the tool — it is whether governance is treated as a sustained operational discipline.”
The highest-value first step for most BFSI firms is a PII inventory — simply knowing where all personal data exists across the data estate. That exercise alone, even without any technical governance infrastructure, creates the clarity needed to prioritise where governance investment will have the greatest impact.
Discuss This with Kiran
If this resonates with challenges your firm is facing, let's have a strategic conversation about your data transformation journey.