Cyber Risk Quantification and Management for Natural Gas Production
'CRQM for Natural Gas Production' is written by DeNexus' Director of OT Cybersecurity, Donovan Tindill.
8 min read
Iván Gómez Arnedo
:
Jan 5, 2023 12:31:52 PM
A detailed analysis of why DeNexus rejected an 'all-in-one' data integration tool
Created with Dall-E — Image by author
Introduction
This is the third article of a series that started with: Building our Data Platform: Why we have chosen Databricks over Snowflake; and then continued with: Building our Data Platform: Why we have chosen FluentD
This article is different from the previous ones because we are not describing why we chose a tool/technology, instead we are describing why we discarded another one: We are not trying to sell any tool/technology, nor is it a detailed tutorial on how we implement a solution, instead it is the summary of the analysis performed on the available ETL (Extract, Transform, and Load) tools to see if their implementation made sense given our use cases, driven by the sensitive nature of our client's data.
ETL tools
Extract, transform, load tools are software packages that facilitate performing ETL tasks. There is a lot of hype in the data engineering world about these kinds of tools, but are they really going to replace most of the data engineering work? Here are some good reasons why it could be a clever idea to use them:
API (Application Programming Interface) providers change their spec/standard, sometimes with notice, sometimes without notice and these tools maintain compatibility with many popular data sources (and destinations) and apply the needed fixes in case of changes in these source systems.
At the end of the day, what is better? Paying a third party to free up the time of your data engineers, so they can devote themselves to more value-added tasks? Or hire more data engineers (a more expensive solution even if you do not have to pay a third party) but have your custom pipelines where everything can be configurable because your team has developed the code?
But what if our use cases are not based on known sources (connectors), does it still make sense to use these solutions? Does it make sense to implement the standard and all the overhead of them when its main advantages (pre-defined connectors maintained by the community…) are not going to be taken advantage of? Too soon for that, let us jump first to our requirements.
DeNexus Trusted Ecosystem (DTE) and DeNexus Knowledge Center (DKC) requirements
We are following the ELT (Extract Load Transform) paradigm, in other words, we are not transforming the data when we first store it in our data lake. So, we are not going to modify the RAW data, instead, we are going to create different versions of it according to the different use cases for internal and external (customers) stakeholders. This means that we are not going to use these ETL tools to process data (This will be done at a more advanced level of our data platform), so all the possible data processing functionalities of these tools are of no interest to us.
The requirements of our DeNexus Trusted Ecosystem are specific, combining data extracted for each client (Inside-Out) with data taken from different external public, private and proprietary sources (Outside-In).
For more information about how DeRISK is building the global standard in industrial cyber risk quantification click here.
Tool review
In the ETL Tools section, we have seen why it can be a clever idea to use Data Integration / ETL tools and several comparisons between the existing options in the market.
As to discuss their differences is out of this article's scope, we will focus on the rest of the article, on the tool we consider the best option available: Airbyte.
Why?
Airbyte architecture — Image by Airbyte
Installation
First, we are not able to use Airbyte cloud because it does not meet one of our security requirements: “Inside-Out extractions have to be performed on the customer environments.”
Therefore, we could only use its Open Source Version.
sudo docker ps — size — format “table \t\t”
The “size” information shows the amount of data (on disk) that is used for the writable layer of each container
The “virtual size” is the total amount of disk-space used for the read-only image data used by the container and the writable layer. Explain the SIZE column in "docker ps -s" and what "virtual" keyword means · Issue #1520 ·
All these Docker containers installed have a total deployment size of 2092 Mb:
Docker containers (and their size) on a normal Airbyte installation. Image by author.
A hypothetical DeNexus data platform using Airbyte — Image by author
As we have described in our requirements, Inside-Out data will be extracted from our customer’s environments (decentralized extraction) and Outside-In data will be extracted from private, public and proprietary sources (centralized extraction). In this case we would be using Airbyte to manage all our data extractions, so we would need:
Custom connectors
When using Airbyte, to add a new connector it is necessary to follow the structure defined in its CKD (Connector Development Kit). That is, even if our connector is quite a simple use case, it will always be necessary to structure it in the way defined on the previous link.
Generating an empty connector using the code generator. Image by author.
So, these templates are setting a set of good practices to follow and are also defining the structure to be used while adding a new connector to their tool.
Whether this is a good practice that simplifies the development or whether it introduces unnecessary overhead is up to the reader to decide.
“A thing is well-designed if it adapts to the people who use it. For code, that means it must adapt by changing. So, we believe in the ETC principle: Easier to Change. ETC. That is it. ---The Pragmatic Programmer
According to AirByte's official documentation, for REST APIs (Application Programming Interface), each stream corresponds to a resource in the API. For example, if the API contains the endpoints:
- GET v1/customers
- GET v1/employees
Then you should have three classes:
- class YYYStream(HttpStream, ABC)` which is the current class
- class Customers(YYYStream)` contains behavior to pull data for customers using v1/customers
- class Employees(YYYStream)` contains behavior to pull data for employees using v1/employees
All “streams” must be defined as a class, and we must represent with a JSON_SCHEMA the structure of that data. What happens if from a single connector we have more than X streams with a changing structure? We will have to modify the code as many times as that structure changes (if we use the dynamic schema definition) or the JSON files in which these schemas are defined (if we use the static schema definition).
If a stream has more than 150 fields and we want to have all the data available in our DeNexus Trusted Ecosystem (for example, because we do not know the possible use cases that we can give to that data in the future) we will have to define a schema for each “stream” and if one of those fields' changes, our schema will also have to change.
Why we decided not to use Airbyte?
The approach we follow is different because, as we have seen above, we implement the ELT paradigm, in other words, we extract all the data available by default and make it accessible in our DeNexus Trusted Ecosystem. By doing so, we can later define processes that will read and transform(T) ONLY the fields we are interested in, and we will NOT have to modify our pipelines or processes unless one of those fields' changes.
Spark SQL can automatically infer the schema of a JSON dataset and load it as a `Dataset[Row]`
Spark’s Schema inference is so powerful because it allows us to delegate that work. Of course, we can have conflicts in the types, but we do not have to deal with and resolve those conflicts every time we load data to our DeNexus Trusted Ecosystem, instead we can solve them only when they really affect our processes or use cases. In other words, when we really use that data.
In the previous section, we explained that DeNexus used Airbyte in this review excersise because it represents the best data integration tool of its kind, in the market to see if it could fit our needs and requirements. While some of the next points are specific to Airbyte , they are also applicable to the other data integration/ ETL tools in the marketplace:
Conclusions
“The success formula: solve your own problems and freely share the solutions.” ---Naval Ravikant
This article has described some of the benefits of potentially using data integration / ETL tools to standardize, centralize, and avoid reinventing the wheel in data pipelines; but it has been also described why these tools would not be a good fit for all the use cases.
When to use a data integration/ETL tool?
When not to use a data integration/ETL tool?
Due to our extensive knowledge of cyber threats, thanks to the DeNexus Knowledge Center, we know the sensitivity of the data that we are extracting from our customers, so we implement highly restrictive requirements before using that data in our Data Lake (to improve our models, to calculate our customer’s cyber-risk or to offer them a list of suggested mitigations to implement and decrease their cyber-risk):
Click here to see how DeNexus' use of data integration tools helps our customers with cyber risk quantification and cyber risk mitigation.
If you would like to see DeRISK in action, click here.
'CRQM for Natural Gas Production' is written by DeNexus' Director of OT Cybersecurity, Donovan Tindill.
'Cyber Risk Quantification and Management for Electric Power Generation Systems' is written by DeNexus' Director of Cybersecurity, Juan Carlos...
'CRQM in Electric Transmission and Distribution Systems' is written by DeNexus' Director of Cybersecurity, Juan Carlos Cortinas.