Data Science Data Architecture

From Olav Laudy Data Science
Jump to: navigation, search

Introduction

This article describes the data architecture that allows data scientists to do what they do best: “drive the widespread use of data in decision-making”. It is intended for various audiences: for IT admins to better understand the needs of data scientists, for data scientists to better articulate their needs and in general for companies who are looking to setup a data science work stream.

Data scientists are kind of a rare breed. Apart from data science, they need to understand business and they need to have IT hacking skills (i.e. ability to get things working in an IT landscape; not to be confused with a penetration/exploit type of hacker). The data scientist does understand more business that an IT person and understands more IT than a business person. The flip side: the data scientist does understand less IT than an IT person and understands less business than a business person. With this set of skills comes the request for a specific workflow and data architecture.

IT versus Data Science terminology

IT landscapes can go as extensive as DTAP: Development, Testing, Acceptance, Production environment, but more often IT architectures follow a subset of those. From a data science perspective, there is a model development environment and a model production environment (i.e. a model scoring environment). In both worlds production environment means the same: a stable, audit-able environment that interfaces with the business under known conditions (workload, response time, escalation routes, etc.). Model development environment, however, has a different meaning for IT and the data scientists. Table 1 spells out the criteria for the different environments and shows that the data science model development environment is neither an IT development environment nor an IT production environment. Note that not all companies have such a strict set of requirements as outlined below, but it is a good starting point for an inventory.


Ds1.png

A model development environment needs to have production-grade availability in multiple aspects:

  • The daily business of the data scientists takes place on this platform, and it not being available stops any model development.
  • The model development environment, over time, will contain a great deal of (analytical) assets, and in that sense, it cannot be restricted in lifetime, nor allows it for an easy re-installation and starting from scratch.
  • A model development environment may have its own backup or testing environment to test the application of bug fixes and patches.
  • Number crunching requires a lot computational power and storage and needs to be sized specific to the data and model requirements expected.
  • The model development environment needs formal backup and escalation routes in case of disruptions.
  • The model development environment comes with production level requirement regarding data availability. It is unfortunate that this needs to be pointed out: a data scientist does not build models on test data. It is amazing how often I’m asked to build a model on 2000 rows of artificially created data with the same columns names as the real data. Such a strategy works when one writes an API to returns a specific data request, however, in data science one learns from data and artificially created data does not contain any interesting structure.

A model development environment needs to have development status in the following aspects:

  • A data scientists needs to work against a database with the ability to create, fill and drop tables. A data scientist is able to create queries that hang the system. That is part of experimentation and may happen once in a while. It will become a lesson learned.
  • A data scientist is not a DBA. Creating tables happens on the fly, with the fullest disregard to proper database management such as naming conventions, indexing, partitioning and database normalization. Restricting a data scientist to work along those lines will kill productivity.
  • The DBA companion may help out to do the proper thing to the database, such a writing clean-up scripts, indexing, etc. In additional the data scientist may request a DBA to set up database schemas, users, archiving etc.
  • The data scientist needs to have fairly unrestricted access to a command prompt and OS level capabilities. It will not be the first time that data is being delivered in the shape of 100.000 zip files or a job needs to be setup to scrape some data from the (intra)web. Although source data or temporary files are preferred to go in the database, sometimes it’s just simpler to have the ability to store data in a csv on disk.
  • Unrestricted installation of software doesn’t have to be among the requirements, however, not having to go through a three-month approval process helps productivity a lot.
  • A data scientist should not need to have access to privacy sensitive data. The data repository containing the historic data can be created under referential integrity (i.e. you can still join tables) with hashed or encrypted sensitive fields.

The need for a separate model development and production environment

Not all analytical models are intended to make it to a production environment, although, the models that are most valuable are not one-time executions, but are embedded, repeatable scoring generators that the business can act upon. The model development takes place in a relatively unstructured environment that gives the possibility to play with data and experiment with modeling approaches. Embedding an analytical model in the business means it migrates from this loosely defined environment to a location of rigor and structure. Not separating the environments leads to a series of issues:

  • An ad-hoc query for a new to develop model can disrupt the scoring of a production model.
  • A data scientist can manually alter scores (i.e. credit scores).
  • There’s privacy sensitive data available for the eyes of the data scientist (as production data is not censored).
  • The model development cycle is likely required to align with the production scoring cycle.
  • Archiving needs are different for model generated scores and models.

Ds2.png

Figure 1 shows the difference between cycles for model development and model scoring. In the development environment, the data scientist comes up with an idea and slowly works towards a ready model. Once it has taken the right shape, it is placed in the pre-production environment (later more), where it is thoroughly inspected. Upon approval, and with the proper controls in place, the model is moved to production, where it is being scored on a set interval. Note that developing the model in the same environment as the scoring, frequently implies that a new version of the model needs to be ready for the upcoming scoring moment, i.e. the new model needs to be developed in between the scoring moments. This rushes the process and is error prone due to the lack of audit-ability and formal model migration process. In separate environments, as shown in Figure 1, after some time, the data scientist has a new idea to improve the model. The current approved model is taken from the pre-production environment, and being worked on. Once ready it is placed back into pre-approval, but as the figure shows, it cannot be approved due to lacking functionality. The data scientist repairs the defect, after which, upon approval, the new model can be placed in production.

The data flow

Figure 2 shows the data flow for analytical applications. The left side of the picture describes the database, the right side the analytics stack: in red dots the scheduling instances, in blue dots the actual analytical processes. The top part “Dev” indicates the model development environment, while the lower box ‘Prod’ indicates the model production environment.

Ds3.png

In the model development environment, the database is divided up into three parts (or schemas):

  • A staging area or (for the data scientists) read-only environment where IT can make data available. This has to be read-only in order for IT to guarantee there is no confusion on what has been delivered (from a quality and a quantity point of view).
  • The data science playground (or sandbox). This is the free area where model experimentation takes place, where ad-hoc questions are answered and where reports and insights are developed.
  • The lower part of the model development environment indicates the pre-production stage. This is an area where the data scientist works closely with the IT department. The discussions that the data scientist and the IT operator have, revolve around a hand-over process of the model. In addition, the IT operator needs to understand the data requirement of the model and needs to prepare the operational environment for the model. The hand-over of a model to an operational team needs to come with an audit structure.

The data flow described below supports the full workflow of data scientists: from ad-hoc reports to models supporting multiple departments. As mentioned, the journey starts with data being made available in the read-only (staging) area. The data available here is a mix of first time deliveries (a data scientist is curious by nature always on the lookout for new data sources) and regular scheduled data deliveries (e.g. monthly new customers, usages, transactions etc.). Initially, the data comes in raw, and is being explored as such. Further collaboration between IT and the data scientists may lead to requests to certain aggregations or selections of data. The regular data delivery is picked up by scheduled tasks that prepare the data for the data science data-mart. Ideally, this is a change-history based data-mart that contains the data to answer 90% of the ad-hoc questions and is capable of generating the modeling data from. Alternative to change-history is the storage of monthly snapshots, however, that makes time based selections and models much more difficult. Note that the playground is intended to contain transformed staging area data, not a copy of the original. Moreover, the playground should ideally only contain data from the staging area in order to prevent non-replicable models.

From the data mart, the data scientist creates two types of data for the modeling: analytical data and operational data. Analytical data refers to the data used to build the model. It is historic data, and is properly split up in train/test/validate. The operational data refers to the data that is needed for scoring. Note that, since the playground only contains historic data, the operational data refers to the format of the data only, not to its recency. This is an important point, as I’ve encountered multiple situations where the data scientists imagined that they needed to have the most recent data in order to score the model (in the development environment). This placed an unreasonable pressure on IT to deliver data with a high frequency in a development environment, with all the undesirable consequences that come from not separating model development from model production.

Once the model is built, that is: trained, tested, validated and confirmed to score on the operational data, the model can be placed in pre-production. Rather than this being a separate environment, it turns out more practical to reserve an area in the development environment specifically for this. In terms of the storage of the model, it can be a folder in the model repository, in terms of the database, it is best practice to not allow the data scientists create the required tables, but to provide the table create statements to IT in order to discuss naming conventions and such. After table creation by IT, the data scientists can insert the operational test data into the table in order to show that the model scores in pre-production. This is important in order to identify any overlooked dependencies.

In order to have the model run in production, IT needs to make the operational data available in the production environment. There are two routes for that. First of all, since IT knows exactly what they placed in the read-only staging area, they can make this available in the production environment. All the data preparation of the model then comes down to the data scientists, who build this as part of the model scoring job. When this scenario is explained to IT, they invariably want to take over and provide the exact data as needed for the model using their preferred ETL tooling. The data scientists are then tasked to document the data transformations in a way that IT can rebuild this. Typically this is not without challenge, as data scientists come up with very creative ways to transform the data, which might not be easy to archive in ETL tools. In practice, it comes down to choosing the middle road: IT provides semi-manufactured data, upon which the data science work stream completes the remainder and subsequently scores the model.

It is best practice to not have the data scientists migrate the model to production. It maybe one member of the data science team with a strong IT background and awareness of the IT policies who becomes the IT-data science liaison and it able to assist the migration.

Data science requires a close interplay between IT and the data scientists. It’s a bottom up process and it’s agile. That means, prior to doing the analyses, it cannot be written out as a list of specifications that need to be followed to the letter. Typically, data scientists start with investigating samples of data in combination with understanding the business, after which requirements for model building and data delivery will follow. This happens in an iterative way and with advancing insights come new or altered data requirements. An IT department that understand this process and can play this game, can greatly contribute to the success of data science and the enhancements it brings to the business.

Concluding remarks

In this article it was discussed how IT architecture can support the workflow of data scientists. I’ve found this architecture to hold for many companies that not have data science as their core business (most industries such as financial institutions, retail industry, telecoms, and manufacturing industry as opposed to companies that, say, specialize in deep learning). I’m also aware of rapidly changing technology, exchanging the traditional databases with Hadoop and alike. In those cases I’ve found that modeling against a database in a training environment (or at least having a database as part of the model development environment) often offers the biggest flexibility. The value of data science comes from the ability to play with data to determine the next steps in the analysis. Any architecture that enhances that ability will result in better outcomes for data science, and hence, better decision making.