We are happy to start new exciting category of our blog - Business Intelligence articles.
After a list of successfully designed and implemented BI solutions I want to share our experience and note all complex moments and issues that we have been faced.
Business Intelligence (BI) - the set of techniques and tools for data transformation into useful information for business analysis purposes.
In this article I want to share our experience with ETL (Extract, Transform and Load) optimization.
Extract, Transform, Load (ETL) - is the base for a data warehousing, these 3 processes manage any data warehouse. Extract - acquire data from data sources, Transform - change data in the proper format, Load - push data to final target (data mart, data warehouse).
We were designing and implementing BI solution for insurance brokerage company Willis Towers Watson.
It was common BI architecture following all general rules:
According to general Rules it was OLTP Database, DW Database, Cubes, SSIS dataflows between OLTP and DW Databases, Cubes and dataflows that should process Cubes.
In our case, production database was with 800 000 members and all related data and was growing on the daily basis. We needed to build reports based on production data and last update date for reports should be not more than 24 hours. So we started to implement BI reports.
In result our BI solution was a combination of three projects:
- Database project with a DW database schema and post-deployment insert scripts (Date Warehouse schema);
- Analysis Services project with OLAP Cubes schema (OLAP Cubes Schema);
- Integration Services project that combines all the steps (Extract and transform data from OLTP Database, Move it into DW database solution, Process Cubes, Update Access) (Integration Service project);
Date Warehouse schema
OLAP Cubes Schema
Integration Service project (ETL)
As you can see I have marked with a red color step Update Access and also you can see this step in the image above.
Further we will speak about this “Update Access” step, what was the issue and how we solved them. So now when that task it is clear in general let me go deeper, explain the issue and how we solved it.
In our BI solution we applied Role based Security. We have different Cubes (For example Claims, Transactions, Managements and so on). Particular Cube has some business items (In Cube Claims we have data about the claims) from all the Companies inside our Application and from all the Policies inside the Company. Reports are based on this data. For example report “Claims by Mode of Payment” or “Claims Value by Territory” and so on (Example or Reports). Each Claim (and report item in general) has exact referent to some Policy from some Company.
Each user in the system has specific Role (HR or Broker) with allowed sets of companies or policies inside Company depend from Role. So each user can see just some items in the system but not all of them.
Let's say we have company A with 5 claims, company B with 7 claims and company C with 11 claims. So if user A is a broker and has access just to Company A and C – we will see just that 16 claims from this two companies. So when some system user will go to Reports section in our Application – he should see just items for which ones he has access for.
For this reason, we implemented Role Based Security (Security architecture of BI solution).
Example or Reports
Security architecture of BI solution
As a Role based architecture – we decided to create one OLAP Database Role per one user and set access to list of Companies or Policies:
Role Based security
IMPORTANT! Generation these roles -is a part of ETL (Integration Service project (ETL) – Update Access). So ETL should be running every 24 hours because last update date for Reports should not be more than 24 hours. And in this step “Update Access” it was .NET code that reads all the users from DW database, reads all the accesses per user and in result, these create Roles under OLAP database. Each Role has some ID (Role Based security). In connection string to OLAP database we are transferring this Role ID – and in result it can see the data with allowed access.
And here was the problem. This ETL step Update Access takes around 10 hours to do that for few thousand users and sometimes it goes to infinity running. But it was not acceptable.
We started to investigate this issue. The delay was here. It was too long operation create Role by Role in OLAP database. Because there are not present some Bulk operation and we needed to create Role one by one.
Our idea was next. In result ETL generates this 4 files:
ETL files structure
File with extension asdatabase contains this Roles description and all the information about the roles and access. This file this is XML. In the "Role Based security" image we can see – we have some role with ID “0455b7f3-0020-42c0-9695-5004663c414a” and this role has access just to some Policy Categories (In our case Policy Category – under some Policy. So basically, HR has access to the list of Policy Categories inside some Policy. But this is too deep of our business behavior and this is not important in the scope of this article). If we open file asdatabase with Notepad and try to find this Role – we will see allowed set of Policy Categories:
Allowed set of Policy Categories per Role
So now we can say – issue is simple and clear. All we need to do – modify file asdatabase manually instead of add Roles to OLAP Database one by one.
In next step we very carefully investigated asdatabase file and understood XML tree. For this purpose we compared two asdatabase files for two OLAP databases – one with Roles, another database - without Roles. We have found all the places that we need to modify and where we need to add tags in this XML and tested it carefully:
Modifying asdatabase file to set the OLAP Roles
In last step our ETL reprocess the Cube and use this modified asdatabase file.
During deploy we receive exactly the same result and exactly the same OLAP Roles but it takes from us few minutes instead of 10 hours.
In the very last step we set up SQL Server job that will execute the dtsx (ETL) package on a daily basis.
Also check out:
Best Regards. Andrew Tsopych CTO of Diceus