Why Data Governance Should Include Analytical Models

Posted in: Technical Track
data governance

Our previous discussion focused on the data retention aspects of our data governance programs. Policies must be defined early, shared via data literacy programs and technical controls built to automate the retention, protection and purging of data. These policies are backed by processes that automate the storage and destruction of data, creating a strong audit trail for compliance teams to use for refining methods and tooling.

 

 

Our conversation to this point has focused entirely on the data within our environments, or the traditional scope of data governance programs. Many organizations are moving rapidly to deploy analytical models to enhance human decision making. This shift pushes us to define our programs as data and analytics governance, identifying and capturing the need to govern our analytical models, training sets, and outputs. This model governance ensures repeatability and elimination of bias from decision making while protecting organizational intellectual property. In this post, we’ll continue to reference data governance, but shifting our definition to include data and the analytical models that power our organizations.

Today’s analytical models can span a multitude of technologies including R, Python, Go, SQL-extensions, and SPSS. The deployment of these models can be via custom ML Ops frameworks or vendor supplied stacks including Google Vertex AI, AWS SageMaker, or Azure Machine Learning. While this technology will facilitate the implementation of governance policies, the organization will be responsible for setting conditions and boundaries of model deployment, testing, usage, and retention.

Model governance is still new as many organizations are early in their journey deploying models for production use. One industry leading the definition of standards is financial services. The use of analytical models for planning aspects like investments, cash reserves, stress testing, and modeling consumer behavior have stood out as early places where a high level of confidence is required. This process started as early as 2011 with the Federal Reserve issuing SR Letter 11-7 establishing standards for models used in various aspects of our financial systems.

Building a strong framework of policies, backed by systems automation for enforcement and auditing ensures consistency in model governance. The key elements of this framework include:

  • Reproducibility: The ability to recreate specific results under specific conditions. While not always required, policies should define where reproducibility is required.
  • Traceability: The complexity of modern business processes demands capabilities to determine how and where key decisions were made or influenced. Organizational standards on model traceability will enable determinations of how processes acted with varying inputs.
  • Accuracy: Accuracy is the measure of the model’s results against a standard set of organizational benchmarks. Model accuracy must be tracked over time to ensure outputs can be relied upon for decision making and interventions made when performance drops below defined thresholds.
  • Performance: Model responsiveness is a key measure of user experience. Monitoring for performance over time, benchmarking minimum levels, and alerting for outlier behavior ensures high levels of user satisfaction.
  • Testing frameworks: Many models will be deployed in complex environments with multiple versions running at a single time. When this type of deployment is accepted policy, frameworks should ensure that telemetry is captured to clearly associate behaviors and outcomes with model versions. This data enables data science teams to make quick decisions about versions of models for future use, refinement or retirement.
  • Bias: In areas where models could potentially reinforce biases seen in data, areas including human resources, credit scoring, ad targeting, or service delivery standards for bias testing should be defined and regularly tested.
  • Retention and  Revision Control: As with any software asset, the ability to reproduce past code bases or models should be part of automated tools and processes. Retention policies should be associated with source code and software build assets to identify how long specific pieces need to be retained.
  • Dependency Mapping: Many analytical models will work collectively to produce actionable results. We must build a holistic view of our model supply chain to ensure we can adjust to changes in vendor supplied capabilities and manage complexities of traceability across assets built by different organizations or vendors.
  • Data Set Lineage: The lineage of our training data sets must be tracked and associated with model versions through automated revision control. This association ensures models can be evaluated if later problems are identified with performance or bias. This association is key to meeting growing regulatory requirements for model reproducibility.

Each of these elements of model governance require specific technical features for implementation and automation. The greater use of automation will enhance reproducibility and minimize human error from model creation and deployment that could negatively affect an organization’s risk posture. Analytical models don’t stand on their own in today’s complex data landscapes; they have unique needs that must be captured in policy, automated through our ML Ops platforms and regularly reviewed to be updated to adjust for changing technology capabilities and market conditions.

The next post will explore the risks and rewards of data governance in legacy environments. Many organizations continue to rely on mainframe or UNIX technology from the 70s or 80s, creating added risk to data governance programs. Risks can manifest through skills attrition, lack of integration or gaps in automation of policy enforcement. We’ll explore these risks, mitigation techniques, and methods to modernize these platforms over time with data governance as the accelerating driver.

Make sure to sign up for updates so you don’t miss the next post.

 

 

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

VP Analytics
Joey Jablonski is VP of Analytics at Pythian, he leads strategic engagements assisting customers in developing their data strategy, defining and executing on data governance programs and building analytical models to power the modern data-driven organization. Prior to Pythian, Joey was VP of Product at Manifold, where he brought a product mind-set is part of all engagements—allowing for delivery of value quickly in any project, and building over time to drive adoption of new data-centric capabilities in an organization. Joey led engagements across industries including high tech, pharmaceuticals and for the federal government. Before Manifold, Joey held executive leadership positions at Northwestern Mutual, iHeartMedia and Cloud Technology Partners. He brings 20+ years of experience in software engineering, high performance computing, cyber security, data governance and data engineering.

No comments

Leave a Reply

Your email address will not be published.