Team structure for Successful BI/Data Warehouse/Assisted Intelligence and Machine Learning.

Geoff Leigh
Analytics Vidhya
Published in
10 min readApr 29, 2020

--

What would work for an Enterprise understanding the continuum of Data in the Business and the value of Machine Learning.

ID 51836680 © Mopic | Dreamstime.com

The first issue is the data that needs to be available so that the business may use it. There are various ways of collecting it from all the transactional systems with structured data, and there are many sources of other data such as streaming interactive logs, call center content, website interactions, other unstructured data such as market sentiment and data that may be of impact to the business such as overall economic factors, commodity and index prices, business and general news information.

A balanced team for an AI project will include three key people, says AI expert Monte Zweben, CEO and co-founder at Splice Machine.

First, a data engineer who can take the information a company collects and turn it into data that’s ingestible by AI and ML systems, or a team that has prepared an enterprise common data platform that is organized and available. This team may also have data preparation, database design and support, and reporting and visualization resources with data movement specialties. There may also be a business need to incorporate content management such as document optical character recognition, automated processing of data movement steps other than through interfaces, API’s or Data Pipelines such as Extraction, transformation and Loading.

Second, a data scientist with domain expertise who knows that, say, weather can affect delivery schedules, or that particular mechanical issues can affect maintenance schedules. The data scientist will also need to be able to test out different algorithms to see which ones perform best, and then adapt them if needed to get worthwhile predictions.

Finally, a software developer is required who can incorporate all this into actual applications.

“These are the kinds of skill sets that we are looking for,” Zweben says.

For many organizations, success with AI is more a factor of balance in these three key areas than it is the number of PhDs that have been hired.

Additional resources to ‘round out’ the team would not necessarily be full time, and would include a Business Analyst, a DevOps specialist with organizational technical infrastructure expertise, Quality Control, Security Assurance and Internal Process Audit Control.

The technical DevOps resource is to ensure alignment with and use of version control and continuous deployment into develop, test, QA and Production environments and may not be necessary if the Organization has a well-defined Infrastructure Architecture and adequate training in its use to ensure that the team has representatives with an adequate level in that skillset.

For the Testing and Quality Assurance role, part of the Design Thinking approach includes Test Driven Development (TDD), to ensure that at each release the objectives set by the business and agreed with the team are met. Again, if there are some architectural common assets that are fully understood, this need not be a separate person.

A Squad or Iteration Lead with project management oversight and attend to various ‘administrivia’ such as booking of meeting schedules, ensuring that process and project documentation is completed by the appropriate team member in the Agile management platform such as on Atlassian Jira, Confluence etc., and any necessary time recording to capital or overhead activities are completed.

A Line management structure should exist, perhaps more matrix-enabled than hierarchical, aligning resources to senior advisers and role mentors and managers that can agree personal goals with each team member and has the expertise in the domains and skills being managed to encourage staff development and meaningful performance reviews.

Enterprise Security, security-driven programmatic development techniques and certifying the security of a system before becoming a production solution, as well as meeting any internal auditing controls and regulatory requirements can be introduced as part of the iteration and agile process, with specialists involved at inception and showcase events, with such activities as penetration testing prior to production release of any externally — facing applications.

The Business Analyst role is not needing a unique skillset in the team, unless there is lack of business domain knowledge in the three key roles. They may double up as the Iteration lead or Squad lead.

Then the 5 perennial problems of Business and IT must be addressed, especially as it relates to realizing that Data is one of an Enterprises Chief Assets.

Problem #1: The one constant is change:

The Business world is always changing, an internal organization would be successful if always striving to seek innovation and efficiencies, so internal structures will change, and the external environment as supplies change, consumer choice changes, local and global economic factors change, seasons change and unexpected impacts such as a global pandemic changes the world. So the creation of a data warehouse or central data platform is not ever completed, as new opportunities need to be reflected, and different focus of insights and the need to update any models with new data and latest trends make analysis based concepts such as rolling averages less sufficient.

Problem #2: Really Big Data:

The amount of information collected by an organization grows over time, but also can grow exponentially as other data sources are shown to have relevance and therefore need to be tracked. So some technologies and approaches to collecting and keeping the data for use are no longer viable, or have a major overhead in terms of support and expertise that a number of ‘Cloud’ technologies offer additional benefits and reduce the total cost of ownership of these physical data assets being kept in-house.

Problem #3: Complexity!

There are no longer just a sales transaction as the total basis of business information, and there are fewer purely hierarchical relationships that direct how data must look like. The supply chain is becoming more integrated, and the way that customer interactions occur in consumer or purely business relationships has many facets., The acknowledgement that few actions are created in isolation to another is more understood, so the motivation as to why a product or service is purchased from vendor A by customer B is not simply because a sales relationship has been established and vendor A has understood all needs of Customer B. How all this data can be related and form a consistent and accurate picture or be used as a trial set of information for analysis and modeling such as classification, PCA or trend mapping is not simple either. Tools have made moving data somewhat easier, such as FiveTran, but there are still challenges in labelling, cataloging, profiling and cleaning of all required data, that should be always the main focus of the Data Domain practitioners.

Problem #4: The Business Domain — Data needs to fit business needs

Technical requirements and software features matter little to a business user. Therefore the time previously spent on BD-50 style of business Design requirements that only went to the detail that usually a junior resource as a Business Analyst could determine and document, rather than adopting the Behavior Driven Development (BDD) approach of design thinking as an accelerated way of mapping the Business Journey, Key Value Proposition and Value Chain to determine the importance and potential returns to the business if this solution is done right.

The underlying concepts are still of mapping current process and future process, but more holistically to the User’s perspective than internally and rigorously following some cult-like convention such as Business Process Mapping Notation (BPMN) which may or may not aid in computer-assisted software engineering. This is definitely an attempt to fit the data and technical process to be over-engineered to the business need.

Problem #5: Flexibility, or the lack of — Data Sources needs constant attention

Data sources will change over time, new ones would be added, and some data sources will no longer be available, so there needs to be constant attention. As some of the data sources may be outside of the direct control of the organization, then they must adapt to times when the data is not available or access is disabled for many reasons, as well as many ‘self-healing’ approaches to minimize disruption such as retrying an API connection a few times, or accessing a data source more than once if the request is rejected within certain parameters. One of the first Robotic Process Automation solutions I worked on was one that had to deal with the access of a data source for market indices being disconnected should another user want to use the single credential of the organization so that the automated script that was running on an open Excel WorkBook errored as the connection to the source had been reset, requiring otherwise manual intervention.

So there are 4 main approaches in Business intelligence to keep the data, in terms of how to structure it –

· Fully Normalized Database (Original approach published by Bill Inmon.)

· Denormalized Star Schema with optional snowflake and the OLAP cube ( Updated approach from Ralph Kimball and others.)

· NoSQL Key Pair columnar data from the Hadoop Big Data movement

· Hybrid ‘Hub and Link’ Data Vault introduced by Dan Linsdedt

Frankly the reliance upon Inmon’s Relational 3NF and Kimball’s STAR schema strategies simply no longer apply. The complexities of Map-Reduce and Hadoop have largely been removed from the major ‘Big Data’ repositories for data lakes, such as Redshift from Amazon Web Services or SnowFlake. In Machine Learning, the focus should be on Feature Engineering, and not necessarily reflect in views that are ‘stars’ with a subset of data organized to support the topic being modelled.

The fully normalized database is good for storage optimization, and therefore somewhat faster if designed with single storage nodes with replication backups. It gets a little messy to handle distributed nodes and high availability, but RDBMS’s such as Oracle and DB2 have sophisticated components that make it much easier, although still requiring dedicated specialist resources to manage if this is still remaining ‘in-house’. Where it starts to fail is the need to constantly look to data table joins or refreshes of materialized view to reflect how the business really sees the data, as somewhat dimensional. A businessperson would often say I need to see this sales figure ‘by week, by region, by product group’ and that starts to look very much like a Star Schema or OLAP cube.

How the Data Vault can be the best compromise — What does a Data Vault look like ?

A Data Vault approach uses the concept of a Hub component, being the main details around a topic, and a Link that shows what is related with a number of Hubs. The Satellite is more like the additional attributes of the Hub topic that may change over time, such as frequently changing dimension values, so that the historical views can be shown whilst not impacting the structure of a Dimension or requiring additional coding to resolve. Views similar to an OLAP cube can be easily assembled.

HUB (blue): containing a list of unique business keys having its own surrogate key. Metadata describing the origin of the business key, or record ‘source’ is also stored to track where and when the data originated.

LNK (red): establishing relationships (Links) between business keys (typically hubs, but links can link to other links); essentially describing a many-to-many relationship. Links are often used to deal with changes in data granularity reducing the impact of adding a new business key to a linked Hub.

SAT (yellow): holding descriptive attributes (Satellites) that can change over time (similar to a Kimball Type II slowly changing dimension). Where Hubs and Links form the structure of the data model, Satellites contain temporal and descriptive attributes including metadata linking them to their parent Hub or Link tables. Metadata attributes within a Satellite table containing a date the record became valid and a date it expired provide powerful historical capabilities enabling queries that can go ‘back-in-time’.

There are several key advantages to the Data Vault approach:

- Simplifies the data ingestion process

- Removes the cleansing requirement of a Star Schema

- Instantly provides auditability for HIPPA. Data Privacy and Protection, and other regulations.

- Puts the focus on the real problem instead of programming around it

- Easily allows for the addition of new data sources without disruption to existing schema

Simply put, the Data Vault is both a data modeling technique and methodology which accommodates historical data, auditing, and tracking of data.

- It adapts to a changing business environment

- It supports very large data sets

- It simplifies the EDW/BI design complexities

- It increases usability by business users because it is modeled after the business domain

- It allows for new data sources to be added without impacting the existing design

So to take an example, a Design Thinking session would map out the key business components and relationships that are to be considered in the topic of ‘Invoice’ , where the business needs are to accurately record the sales, and have a way of managing cash flow and customer credit exposure as the main concerns of the business that need to be resolved.

With a BDD relationship and Value Chain Proposition agreed with the Business SME’s and Stakeholders around the topic represented below, that should determine a structure to represent the data in a common data platform.

An example of a Data Vault logical model that applies to the Invoice Domain is shown here :

Figures From Dan Linstedt Blog article: Initial Logical Data Model for the Invoice Domain

A completely different set of Data Vault Objects may be determined for a production, manufacturing or warehousing concern if this component of the organization existed and was of a concern, with the re-use of some of the common items that are classified as Hubs, such as the Product Hub, but extended to Locations and Distribution patterns, shelf life, product costs and assets.

Another set of Data Vault objects may be assembled to indicate the issues and concerns on pricing and marketing effects, as will customer analysis and geographic trends for identifying new markets or consolidation of existing positions.

This aligns perfectly to the Agile framework, in that short and definable chunks of work can be planned and shown to be effective and add value, as contrasted in the attempt to make all data align and resolve to a solution set in a more traditional Star Schema approach. A topic should easily be completed in a typical 2-week sprint in an Agile Framework with the right roles and team abilities, including the Data Engineer and a specialist to ensure that the data is right and available.

--

--

Geoff Leigh
Analytics Vidhya

Making Data into Actionable information and insight Over 30 years of Data and Systems engineering, development, consulting and implementation.