Skip to main content

CryspIQ - my first thoughts

· 3 min read
Phlippie Smit
CryspIQ Technical Owner

"We're building something amazing"

When I spoke to Vaughan a couple of years ago he told me that Dan and he was building something amazing. He spoke about a data model that can take in any type of data and report on it.

Having worked in large corporate and government data warehouse environments, I was naturally sceptical as you may be. I also fiddled with universal data models when I was studying at university, so the project and product had my attention.

I really wanted to see it in action and after spending some time on the East Coast in another BI role and trying my hand as an IT Manager, I joined the CryspIQ team.

My first thoughts

The first few months were hard - very hard. I had to let go of the Data Warehousing knowledge ingrained in my mind from having worked with both Kimball and Inmon implementations. I kept trying to use the traditional ways, and my mind refused to make the shift to what CryspIQ was.

After some late nights and doing a few proof of concepts to potential clients, things started making sense - you could say the penny eventually dropped. I saw new opportunities and how CryspIQ simplifies the end to end traditional data warehouse process.

And that was not where it stopped, the services used offered new, exciting ways of managing data quality whilst loading data. The services also offered ways to lookup data against external data sources and API's.

Further to this CryspIQ also had the ability to generate exceptions on the exact row of source data that refused to load because of a data issue, and it pointed me to the column and the reason for the data not loading. Gone were the days of working hours to figure out why a pesky multi-gigbyte flat file wouldn't load.

We did a lot of development, testing, and prototyping in 2023, really throwing every data scenario we could think of at CryspIQ. It handled every scenario well (after thinking it through with my new-found understanding of CryspIQ).

My thoughts now

CryspIQ is not only a technology, it is also a methodolgy. The potential is huge for businesses to speed up access to insights from to the ever growing mountain of data that is generated every day.

I'm extremely excited about what 2024 will bring, and particularly about how we will use AI to make thing even easier for people to interact with their data.

Big Data

· 7 min read
Vaughan Nothnagel
CryspIQ Inventor

A pragmatic business problem resolution perspective

The hype cycle of big data has brought a number of single stream suggestions to bear in resolving the world’s insatiable appetite for information. There are 3 primary solution focus areas that show intent in solving the big data problem, each with their own specific benefit: -

  • Technology – Database fragmentation, Multi thread parallelism, Logical and Physical Partitioning;
  • Infrastructure – Processor Componentisation, Tiered Storage Provisioning, Hardwired Distributed Storage; and
  • Application – Data De-Normalisation & Aggregation, Read-ahead logic, Pre Process aggregation, Performance re-engineering.

The truth be told, each specific solution in isolation provides of itself an improvement opportunity and each will bring varying degrees of success to an organisations ability to handle the big data problem if, and only if, the right questions are asked. Only through this rigorous analysis and functional de-composition can one select the right method(s) for resolution to provide long term resolution as opposed to short term ‘disguising’ of the issue.

Let’s pause for a moment and consider the statement ‘Big Data’…. Does it mean lots of the same stuff or a wide variety of lesser volumes? Does it mean a need to visualise the detail or an aggregated view or do we need to trawl/farm the data continuously with the intent of detecting deviation from the norm or threshold triggers? All are valid questions and of course all have different answers and possible solutions but few people and organisations are recognising the breadth of the problem and here-in lays the rub.

Big data is a business problem and not an IT one yet technology has replied with many solutions some of which are lightly referenced below: -

  • Storage engines (viz. Amazon) and the cloud and thin provisioning providing potentially limitless (reasonably priced) storage for an organisation to grow therefore available storage is typically no longer a severely limiting factor;
  • Rapid load and retrieval parallelism (initiated by the Teradata’s of the world) and in memory storage (SAP HANA, Hadoop HDFS, Columnar DB’s) is not new and allows massive volumes of data to be processed in the blink of an eye therefore data retrieval rate is no longer a limitation;
  • Improved buffering, dynamic index determination (query access path analysis) and internal optimisation for standard relational database technology allows databases (MS SQL, DB2, Oracle, SYBASE, MySQL and others) to read huge amounts of data seemingly instantaneously meaning that databases are less and less the limiting factor; and
  • Solid State Disks (SSD) provide a way around the disk IO factor therefore IO is said to be less of a problem than it used to be.

In essence, I believe that one should be asking different questions or combinations thereof focussed at the business user if you truly want to implement the correct solution to the problem. Things like:

  • ‘Is time relevance of data important?’,
  • ‘Are your business actions going to be affected by providing an aggregated view of the whole?’,
  • ‘Is your historical data in fact relevant if the parameters under which they were collected have changed?’,
  • ‘Is the level of granularity of your data of business relevance and should there be a level of roll-up?’ and
  • ‘What is your data life-cycle policy and is it working for you?’.

These and other data usage related discussions will be allow a data bigot (such as I) to deliver a long term sustainable solution. In the rest of this paper I look to cover the business directed questions in a high level of ‘How does this affect the solution to be provided’ manner.

Is Time of relevance in your data?

By asking this question we allow determination of time relevance boundaries that could impact data availability – the reality is that while the CIO’s may want to boast on the amount of data that they have, it is my experience that data older than a year is in fact seldom used to influence near real-time or tactical decision making. This data could be aggregated (by month/week/year) to reduce volumes and provide the historical trending required without forcing decision engines to trawl billions of rows. Also, older data usually applies to a different set of operational conditions (less tills, fewer branches, less AMS, different product ranges etc.), subsequently, without some intense rationale around ratios for ‘levelling the playing field’ context is often incorrectly represented and one may be better served by shortening the historical time horizon against which queries run.

Is your data life-cycle working for you?

Similar to the time relevance question, stale data is often the clogging factor for queries that require a rapid response. It pays to have a decent data design/lifecycle and methodology for staged archival (Logical and Physical) as this will allow true analysis of how frequently you need to access the old data since you will be able to track physical access. In determining actual usage, many organisations that I have worked with have achieved results exceeding 2000% improvement in queries executed purely by applying a slightly more informed data life-cycle.

Have you analysed what you are asking the queries to do?

Many say that this is IT 101, however, never in my 30 years in the industry, never have I come across an organisation that consistently or iteratively applies this performance consideration to their solutions. The truth be told, databases are a funny beast and in reality, a single record added to a relational database may result in a query going from seconds to hours or perhaps even deadlock. There is an independent practice, developed by a former colleague called Cost Effective Computing (CEC) , which is loosely associated to the Zachman architecture framework and will, very quickly, pin-point areas of ‘congestion’ and opportunities for improvement in a problematic solution’s processing path.

Is your data at the correct level of granularity?

Incorrect granularity is more often than not a legacy of a poor architectural and or business decision made with good intent at the time but little consideration for the future. A point in case is counting the number of wheels crossing a road counter and being counted as a pair. The relevance disappears if one considers that there could be large multi wheel trucks or motor cycles all counted in the same way. These counts without relevant associated context of weight, time between triggers, speed etc. serve little purpose other than to be just that, a count.

Is your data suitably isolated as to allow exploitation of parallelism?

If our solutions attempt to farm information from the transactional system in their raw form, there are some challenges that will be encountered, not the least of which is the contention with the operational requirements. In transactional systems, data is arranges to allow rapid single transaction processing to improve the user experience and seldom has consideration for parallelism or record isolation to allow multi stream retrieval. This situation gave rise to the Data Warehousing of old where data was re-arranged (typically in a monthly cycle) and the Operational Data Store concepts. I strongly urge not to knock this old school approach and consider changing the refresh cycle to allow for more current information. You will in most cases notice that with farming Big Data, there is seldom a requirement for sub-second responses and minutes will normally suffice.

Summary

In summary, the above is by no means an exhaustive list of business questions to ask and one should realise that although there may be nuggets of improvement opportunities, on-going smooth operations and information provisioning from needs to be approached differently, it is after all Data so why not start there. A more pragmatic and complete approach requires a passion for performance and data collectively, a rare trait indeed. I would urge all organisations that if you find individuals that possess these traits, hold onto them as they will be your saviours as the volumes of data continue to explode.

Data Preparation – The New Way

· 6 min read
Vaughan Nothnagel
CryspIQ Inventor

For two and a half decades or more, organisations have followed one of the two data warehousing principles of Inmon (CIF, GIF and DW2.0) and Kimball (Star Schema Fact models) for preparation of data. Both methods have intrinsic benefits but leave organisations with some challenges in accessing warehoused data. More recently Amazon S3 has provided data lake capability which similarly requires deep IT knowledge and a somewhat prescriptive data understanding to be of downstream value.

By taking the best of both data warehousing practices a new paradigm is possible however, one fundamental mindshift is required to bring the value of data closer to the surface for business self service, analytics and reporting. This mindshift is the breaking of the human paradigm of clustering data in a format that represents the data source (Transactions remain transactions, readings remain readings and functional records remain functional records) as this is the area that restricts data from being used in a more abstracted way.

To break this mould, we need to consider the incoming data as just that….. DATA rather than clustered structured content. When we do that, the most granular elements of the data is stored as independent pieces of data of a specific type. These types of data are finite in number irrespective of your business or source system from where it is obtained.

This decomposition of source records allows one to store the incoming data at the granular level clustered with data of like type from other inputs meaning that the underlying data structures used to store the data remain static by nature and, with an element of business training, available for a business user to consume.

CryspIQ (Patent AU2016900704) utilises a combined practise of: -

  • A ‘single organisational context’ to overcome nomenclature differences across different areas of the organisation (normalisation and standardisation of context);
  • A ‘transactional de-composition’ philosophy forcing the deconstruction of individual elements from source systems and thereby disassociating them from their original structural format constraints and their source of origination (information decomposition);
  • Recording only specific elements of the source data (as opposed to the whole record) within your organisation (One-off build) as one or more of the finite types of data;
  • Retention of time-based sensitivity for event driven recording; and
  • Multi-Dimensional representation of the generalised data types allowing cross business domain analysis and reporting.

Using the above methodology, kudos must be provided to the originators for their contribution to the CryspIQ design as elements of each are evident.

  • From Inmon – a Single Organisational Context and from
  • Kimball a Fact-Based Star/Snowflake schema model.

As with these fore-running concepts, the CryspIQ product and methodology has been designed, implemented and made commercially available to enable greater consumption of enterprise data. Some elements of big data practice have also enabled this evolution in data warehousing, specifically MPP (Massive Parallel Processing), big data engines such as Hadoop and multi threaded/columnar data access and these, now common practices, allow the total capability of the new product(s).

To explain the above product in one sentence, one could label it as: - ‘A functionally agnostic fine-grained Operational Data Store of factual detail (past, present and potentially future) that represents any data source’s specific elements in a single business context, irrespective of business type, source system or desired downstream use.’

Through adoption of the methodology a number of benefits are immediately realised by organisations namely:

  • Stabilisation of the underlying data storage model means less discovery time for data engineers, data scientists and business users improving time to value for common functions to minutes as opposed to days or even weeks;
  • Having a single structure filled with multi-faceted data allows an educated business user from any area of the business to self-serve in the reporting and analytics and dashboards through any of the common Business Intelligence / Reporting Tool sets, even if they have no understanding of the original source systems data ecosystem;
  • With a single structure representing the entire organisation and data no longer being constrained to source system structures, stored data never becomes redundant, so, in the event that operational source systems are changed, data from the original source system is stored side by side with that from the old system meaning zero redundancy over time and no migration need at the time of change;
  • Having all of your data represented across the common type definitions, results in a single time perspective irrespective of granularity and the ability to remove a specific time dimension association of each record; and
  • With all data conforming to a single context, business alignment across multiple business domains is achieved, this enables cross functional analysis and reporting to be performed with no data discovery time improving business turnaround.

How the integration is achieved is derived from modern data interchange mechanisms, however, the maturity of an organisation in the data exchange domain may drive differences in implementation to achieve the same goals.

Simply put:

  • a message structure is defined for source data expected and captured/loaded into the mapping engine;
  • the input structure is then mapped using the organisations customised CIntelligence GUI mapping tool to its requisite destination(s) in the repository; and
  • the CryspIQ custom services in the operating system that initiate the processing of that data on receipt.

The data interchange is a nuance change from existing ETL/ELT in that the separation of function is more explicit where data is ‘Pushed’ (P) or 'Pulled' from the source for delivery to the mapping ‘Prepare’ (P) engine which in turn delivers ‘Load’ (L) ready data elements. Implicitly, this PPL method isolation means that the ‘L’ component is only ever developed once, the prepare ‘P’ is an administrator level GUI-controlled function requiring little or no IT contribution and the Push is the only system specific development required for new data entering the repository. This process isolation aids organisations in delivering new data to the repository structure in time that is an order of magnitude faster that what business is used to. Experience has proved that improvements of up to 85% in turnaround time from source identification to active use of the data in the solution. We now talk in as little as hours to realise new content in the repository where history has typically been in days, weeks or even months.

Since the product is not dependent on a specific technology, it can be implemented with consideration of the existing investment in any current Business Intelligence / Reporting Tool sets from Teradata, Informatica, Cognos, Microsoft PowerBI, SAP Business Objects / Business Warehouse and even Amazon’s S3 data lake.

Data Warehousing – Time for a New Paradigm

· 7 min read
Vaughan Nothnagel
CryspIQ Inventor

For more than 25 years now, businesses have followed a pattern of mapping business transactions in their own context into a data warehouse. Realistically, this practice, driven by the ideals and methodologies of either Bill Inmon or Ralph Kimball, has to-date provided a great ability to re-format transactional data into a model that enables rapid retrieval and consistent enterprise reporting platforms. I believe that through this analytics evolution cycle, we have become a victim of our own design in that the value delivered from a data warehouse built in this manner is, by and large, not conducive to encouraging new questions to be asked of your business data. This stifled pattern of use leads us to, typically, answer questions from complex data warehouses for answers that we could just as well have built a static report from the operational system.

What’s the point you might ask, well, simplistically, if the business transaction from one area in your organisation looks like and apple and another looks like an orange, the reality is that it is difficult to compare or analyse the two together. There are a number of anomalies of which timeousness of information, granularity and the absence of common associative information are but a few that restrict us in doing this effectively. So, if we insist on making each warehouse entry look like the apple or orange from where it was sourced, the warehouse is, by design, failing to deliver much if any value other than migration of the reporting platform to a different system.

Some organisations have gone some distance toward standardising the context of the data warehouse(s) and have reaped the associated reward that comes within the data management and governance side not to mention having a somewhat common context and dimensional view across the business reports produced. But, and it is a big but, at the end of the day we are still joining apples and oranges of varying granularity levels as well as different structures. This said, where our thoughts have gone in CryspIQ is looking to resolve this has been to look into the truly successful operational application solutions and start analysing what makes them so good.

The new paradigm explained

Successful business applications have, without exception, one thing in common and that is, they are designed to deliver their primary function well. Makes sense doesn’t it. By example a financial system irrespective of brand is great at finance transactions and can probably be configured to carry some other functions finance related but it cannot ever effectively cover, for example, an application that does inventory management without significant extension. So why have we, in the Business Intelligence world, persisted in making our data warehouse(s) designs to keep trying to do analytical processing whilst keeping the records on a one-to-one mapping with the source? Surely we are just carrying the problems experienced in reporting across multiple different business transactions from multiple systems into one, albeit single, system where we are trying to apply some smarts to untie the knots? Ultimately we still end up in most circumstances with the same apples and oranges explained earlier….. This in mind, we at Crysp Pty Ltd from the isolated city of Perth, Western Australia have categorically started to turn the thought processes for Business Analytics and Information Management around. What if we were to ignore the existing source systems as a point of design for the data warehouse and were to instead look at the analytical requirements of our business as a start point? What if we could build a system that can deliver what we need directly from the repository, would that make a difference? The simple answer is ‘Yes’ . Try thinking of it this way, in order to simplify and get to the light bulb moments that my partners and I reached a couple of years back now: -

  • Does the current typical analytics design methodologies and derivatives thereof have any inherent flaws – Answer – No – So, no need to re-invent the wheel here use what works!;
  • Is the delivery platform of importance for delivery of analytics to where it’s needed – Answer – Somewhat yes but it is not restricted to any one tool or vendor – So, don’t discard investments in delivery channels and mobility, you can re-purpose the technology for use in a new paradigm;
  • Is the analytical structure of an organisations data need already known – Answer – Mostly ‘Yes’ – So, you can normally re-build a successful analytics system using your existing people and knowledge with some guidance and often with your existing dimensional views;
  • Is the data in its operational form and context able to deliver true analytics? – Answer – Mostly no as it is too contextually bound to its source and does not easily match other data in the organisation; and finally, the coup-de-grace
  • Is there opportunity to rebuild/restructure my data warehouse to enable my organisation to achieve true information analytics in the business’s hands? – Answer – Yes, yes and yes again…

If you’ve seen it already, good for you otherwise, here it is: -

“Restructure an existing data warehouse whilst maintaining the organisations, already understood, dimensional context, delivers a system capable of providing analytics as a function rather than simply collecting records (often for the wrong reasons) and then trying to make sense of it.”

Some will say, we already do that and I would lay down one challenge to these nay-sayers to see if this is really true. If the business wants to add brand new content to the current front end analytics, what has to be done to make this happen? If your answer results in more than one item, then, chances are you are not yet where you need to be in your restructure efforts. The game really changes when you are able to truly drive your organisations analytical capability from the user side, not from often expensive and time consuming IT technical competency. We at Crysp Intelligence recognised this need and have delivered a flexible methodology where your organisation needs to only do one thing one thing to add brand new content to the analytics engine, just think how powerful that would be measured in financial and time to value terms?

Let me answer that with some actual cost and effort facts to show you the financial and time difference between the current model(s) and the new idea……

  • Current costs of adding new content to a large corporate data warehouse environment ~$50,000.00 vs proven cost of ~$6,000.00 to deliver new data to the warehouse in the new paradigm.
  • Current time to add new content to a large corporate environment ~ 3 Months (12 Weeks) vs timed delivery of 5 Days for the same content using the new paradigm.

Simplified into pure business speak, would you like to reduce your cost for new data in your Business Analytics /data warehouse engine by up to 88% or more? And would you like to see the new data content arrival rate in your analytics core increase by up to 92%? I would hesitate to say that the answer for any business has to be a resounding ‘YES’.

Couple this with the results of moving the business analytics responsibility into the hands of the business users and removing constraints typically applied through current database structures (remember the apples and oranges discussion) we believe that our new methodology will be of interest to you.