Since the update of the NERC CIP Standards on July 1st 2016, there has been some debate within the industry on whether or not information can be...
Building our Data Platform: Why we have chosen Databricks over Snowflake
Building our Data Platform: Why DeNexus chose Databricks to solve cyber risk
At DeNexus, we are solving the cyber risk challenge that is plaguing critical infrastructure, and the insurers that underwrite them, across the globe.
The Challenge: Reliable Data
Critical infrastructure is much more than airports, power lines and energy distribution centers. It involves ecosystems of complex operational technology (OT) and IT networks, ICS, enterprise software and people.
In addition, critical infrastructures are under increasing cybersecurity risks. Energy, renewables, manufacturing, transportation, supply chain ecosystems and more - all have been victim to increasing ransomware and malware attacks over the last five years. Ransomware attacks targeting ICS and OT have grown over 500% over the last two years alone. The cyber risk challenge is an expensive one where the shear dollars at stake is perhaps only eclipsed by the monumental amount of data that this challenge presents.
DeNexus has created its proprietary Data Lake, DeNexus Knowledge Center, to collect, store and use the data necessary to develop and train our proprietary risk quantification algorithms and to productize its use by our clients from a secure, efficient, flexible, and highly scalable risk quantification SaaS platform – DeRISK.
Figure 1: DeNexus Knowledge Center
- Allow data access to different types of users: Advanced users (Can take advantage of complex transformations) and Basic users (Only familiar with SQL).
- Strong data versioning control: The ability to ensure that a certain file/table version will not change for advanced modeling. All past transactions within the Data Lake should be recorded so access and use of previous versions of data should be easy.
- Disparate data support: Structured, unstructured, or semi-structured data and different data formats, from DeRISK’s inside-out, outside-in and business intel data sources.
- High scalability: We want to be ready for Petabytes of data volume as we are rapidly growing. "Design for Infiniti: Big enough will not always be Good enough."
- Infrastructure delegation: The solution should be (as far as possible) SaaS or PaaS to optimize the use of in-house human capital.
- ML-OPS: One of the main uses of our data lake is to leverage data through the application of models (the main customers of the data platform are our data scientists). The ML-OPS paradigm should be followed as a best practice measure, to facilitate development and speed up production deployment.
- Strong security and compliance constraints: Data storage must be flexible and dynamic.
After evaluating existing solutions on the market, solutions based on the well-known Data-Warehouse concept were discarded. With the fast access to data provided by formats such as Parquet or Avro and tools such as Spark or Presto/Trino, does it still make sense to differentiate between a Data Lake and a Data Warehouse? It depends on the use case, in our case, it does not. Why? Our main data sources do not come from RDMBS systems such as SQL Server, PostgreSQL, MySQL, etc. In fact, our data does not come originally from anywhere because we are not migrating, we created our data platform from scratch.
Bill Inmon has recently made a similar observation:
“At first, we threw all of this data into a pit called the ’data lake’. But we soon discovered that merely throwing data into a pit was a pointless exercise. To be useful - to be analyzed - data needed to be related to each other and have its analytical infrastructure carefully arranged and made available to the end user. Unless we meet these two conditions, the data lake turns into a swamp, and swamps start to smell after a while. A data lake that does not meet the criteria for analysis is a waste of time and money. “ - Building the Data Lakehouse By Bill Inmon
The Solution: Data Lakehouse
We needed a data management system that combined the key benefits of data warehouses (ACID compatibility, data versioning and optimization features) and data lakes (low-cost storage and a wide variety of data formats), so we needed a Data Lakehouse platform and Databricks was chosen due to its native implementation of what a Data Lakehouse is and the reasons described in following section.
Figure 2 Data Warehouse: Data Lake - Data Lakehouse comparison
Machine Learning (ML) algorithms do not work well with Data Warehouses (DWH). Unlike BI queries that typically extract small amounts of data, these ML algorithms process large datasets without using SQL (XGBoost, Pytorch, TensorFlow….). Reading this data is inefficient due to the constant transformation of data types (performed while using JCBD/ODBC connectors) and often there is no direct compatibility with the internal proprietary data formats used in data warehouses. For these use cases, data could be exported to files in open data formats, but this adds another ETL step, which increases complexity and staleness.
It could be argued that “cloud-native” data warehouses like Snowflake with support to read external tables in data lake formats (open data formats), implements the Data Lakehouse approach too but:
- Their main data source is their internal data (with a storage cost much higher), so ETL pipelines will still be needed in some cases (Adding processes to maintain and more possible points of failure)
- They can’t provide, over the data in data lakes, the same management features that they do for their internal data (EX: Transactions, indexes…).
- Their SQL engines are mostly optimized for querying the data in their internal format.
Furthermore, the single use tools to directly query data in a Data Lake storage (like the previously mentioned Presto/Trino or AWS Athena) don’t solve the whole data problem as they lack basic management features that a normal Data Warehouse has (ACID transactions, indexes…).
Figure 3: DeNexus Data Platform
Why does it meet our needs?
Allow data access to different types of users: For using SQL to access the data, someone has had to process the raw data and give it structure. Can those transformations be done with simple SQL? The answer is yes, but good luck with that...
SQL is great for many things but not for others (it is very difficult to implement recursion or loops, difficult to use variables…) because it’s not a general-purpose programming language. As we do not know the full scope of the transformations that will be necessary during the implementation of our DeRISK product roadmap, we decided to go for versatility: Databricks allows the execution of Spark and Python, Scala, Java, R and …… SQL! So, it can be used by many types of users. Voila!
Strong data versioning control -> Delta. Databricks natively implements DELTA format. This solves one of Spark’s main problems: it was not compatible with ACID (Delta is fully ACID compatible). In addition, this allows a recovery system in case of errors in the pipelines and provides a simple way to guarantee, for example, that the data used in developing models will always be the same (Data versioning). Furthermore, Delta is fully open source.
Databricks (like Spark) allow us to deal with all types of data (structured, semi-structured and unstructured) and data formats:
In addition, in case Spark does not support a specific format, the connector can be developed manually (Spark is fully Open Source) or use native Python, Scala, R or Java libraries (Databricks is much more than a managed Spark).
“Superior technology: Until we see technology leaders like Google, Netflix, Uber, and Facebook transition from open source to proprietary systems, you can rest assured that systems based on open source, like Databricks, are superior from a technology perspective. They’re far more versatile.”
Scalability is linked to the PaaS nature of Databricks and the use of all cloud resources in our Data Platform: more storage needs are met by S3, and one-off needs of new environments (if a new Data Scientist joins the team, it is no longer necessary to configure his/her computer locally) or more processing power are met directly by Databricks, which frees up much of the time of very valuable resources: software engineers. At all times you can have detailed control over how many machines are running and what features they have (also to avoid surprises in billing!).
In addition, using Spark DBR (Databricks’ implementation of Spark) is much faster than regular Spark, which makes the extra price paid for Databricks Runtimes worth it.
Figure 4: Spark Open-Source vs Spark DBR (via YouTube)
Databricks + managed MLflow as the full ML-Ops solution. Providing environments for model development and a platform for the whole machine learning lifecycle.
GitHub - mlflow/mlflow: Open source platform for the machine learning lifecycle
It allows our Data Scientist to easily track the table versions used in an experiment and reproduce that version of the data later. Furthermore, it gives them a collaborative environment in which they can easily share their models/code with other colleagues.
It can even be used in conjunction with other ML platforms like Azure-ML or AWS SageMaker: a model registered in Databricks Managed MLflow can be easily used in Azure-ML or AWS SageMaker.
Furthermore, the use of Databricks Managed Mlflow allow our Data Scientists to easily parallelize their algorithms with the use of Spark ML and Koalas (Implementation of Pandas in Spark).
Data storage and processing layers fully decoupled. Data can be in wherever place or format and Databricks can be used to process it (separate compute and storage). No proprietary formats or tools are being used, so we have a high flexibility to migrate our data.
The diagram at the beginning of this article (Figure 4) shows the three stages of the data and the tools used in each of them:
- Data Processing: Databricks, Python+ AWS Lambda, EC2.
- Data Discoverability: Databricks, AWS Athena.
- ML-Ops: Databricks, AWS SageMaker.
There is one thing in common: the use of Databricks.
Furthermore, there is no kind of vendor lock-in (AWS Glue Data Catalog is being used to externalize the metastore) and a pay-per-use model allow us to use alternatives for certain scenarios that we do not have right now but that we could have in the future.
Figure 5: A Data Platform to rule them all
“If you know that with a good architecture and data model you can still achieve most of your data consistency, governance, and schema enforcement concerns … and you want more power and flexibility on top of that data … then pick Databricks .. there is very little that Spark and DeltaLake cannot do”
If you want to learn more about DeRISK and the way DeNexus leverages its proprietary DeNexus Knowledge Center, request a demo!