This article is different from the previous ones because we are not describing why we chose a tool/technology, instead we are describing why we discarded another one: We are not trying to sell any tool/technology, nor is it a detailed tutorial on how we implement a solution, instead it is the summary of the analysis performed on the available ETL (Extract, Transform, and Load) tools to see if their implementation made sense given our use cases, driven by the sensitive nature of our client's data.
Extract, transform, load tools are software packages that facilitate performing ETL tasks. There is a lot of hype in the data engineering world about these kinds of tools, but are they really going to replace most of the data engineering work? Here are some good reasons why it could be a clever idea to use them:
Using these tools makes a lot of sense in terms productivity: if more than X people have a need (i.e., extract data from Google Analytics) and each of those people implements a different solution (i.e., a Python script) then X-1 resources have been lost.
Benefits of scale + specialization.
It is hard to find great data engineers so it could be a clever idea to pay for tools that already solved your problems and dedicate data engineers to do something that brings more value
API (Application Programming Interface) providers change their spec/standard, sometimes with notice, sometimes without notice and these tools maintain compatibility with many popular data sources (and destinations) and apply the needed fixes in case of changes in these source systems.
At the end of the day, what is better? Paying a third party to free up the time of your data engineers, so they can devote themselves to more value-added tasks? Or hire more data engineers (a more expensive solution even if you do not have to pay a third party) but have your custom pipelines where everything can be configurable because your team has developed the code?
But what if our use cases are not based on known sources (connectors), does it still make sense to use these solutions? Does it make sense to implement the standard and all the overhead of them when its main advantages (pre-defined connectors maintained by the community…) are not going to be taken advantage of? Too soon for that, let us jump first to our requirements.
We are following the ELT (Extract Load Transform) paradigm, in other words, we are not transforming the data when we first store it in our data lake. So, we are not going to modify the RAW data, instead, we are going to create different versions of it according to the different use cases for internal and external (customers) stakeholders. This means that we are not going to use these ETL tools to process data (This will be done at a more advanced level of our data platform), so all the possible data processing functionalities of these tools are of no interest to us.
The requirements of our DeNexus Trusted Ecosystem are specific, combining data extracted for each client (Inside-Out) with data taken from different external public, private and proprietary sources (Outside-In).
Data cannot leave the customer environment without crypto methodologies such as encryption and in some cases anonymization, so the extraction process must be done in the customer environment.
All sensitive customer data will be de-identified and handled in a secure manner. For more information, visit our DeNexus SOC2 Type 2 blog.
As some data extraction processes must be performed in the customer environments, the size of these deployments must be minimum.
If data is not customer-proprietary (ex: public sources) it could be extracted using centralized tools and it could leave the region in which they were produced.
There is a common structure for all the data extraction processes.
It will speed up the development of new connectors and it will help in the on-boarding of new hires. Furthermore, it will help us follow the DRY principle.
For more information about how DeRISK is building the global standard in industrial cyber risk quantificationclick here.
In the ETL Tools section, we have seen why it can be a clever idea to use Data Integration / ETL tools and several comparisons between the existing options in the market. As to discuss their differences is out of this article's scope, we will focus on the rest of the article, on the tool we consider the best option available: Airbyte.
Plenty of already developed connectors (Increasing number)
Easy to on-board new custom connectors (it enforces a common structure).
Motorization + a nice UI.
It could avoid the need for data engineers in the first stages of your data platform or if your use cases are common ones, thus making it easy to follow
Airbyte architecture — Image by Airbyte
First, we are not able to use Airbyte cloudbecause it does not meet one of our security requirements: “Inside-Out extractions have to be performed on the customer environments.”
Therefore, we could only use its Open Source Version.
All these Docker containers installed have a total deployment size of 2092 Mb:
Docker containers (and their size) on a normal Airbyte installation. Image by author.
A hypothetical DeNexus data platform using Airbyte — Image by author
As we have described in our requirements, Inside-Out data will be extracted from our customer’s environments (decentralized extraction) and Outside-In data will be extracted from private, public and proprietary sources (centralized extraction). In this case we would be using Airbyte to manage all our data extractions, so we would need:
Same number of Airbyte deployments as our number of customers.
Another Airbyte deployment to manage the centralized data extraction (Outside-In).
When using Airbyte, to add a new connector it is necessary to follow the structure defined in its CKD (Connector Development Kit). That is, even if our connector is quite a simple use case, it will always be necessary to structure it in the way defined on the previous link.
Generating an empty connector using the code generator. Image by author.
So, these templates are setting a set of good practices to follow and are also defining the structure to be used while adding a new connector to their tool.
Whether this is a good practice that simplifies the development or whether it introduces unnecessary overhead is up to the reader to decide.
“A thing is well-designed if it adapts to the people who use it. For code, that means it must adapt by changing. So, we believe in the ETC principle: Easier to Change. ETC. That is it. ---The Pragmatic Programmer
According to AirByte's official documentation, for REST APIs (Application Programming Interface), each stream corresponds to a resource in the API. For example, if the APIcontains the endpoints: - GET v1/customers - GET v1/employees Then you should have three classes: - class YYYStream(HttpStream, ABC)` which is the current class - class Customers(YYYStream)` contains behavior to pull data for customers using v1/customers - class Employees(YYYStream)` contains behavior to pull data for employees using v1/employees
All “streams” must be defined as a class, and we must represent with a JSON_SCHEMA the structure of that data. What happens if from a single connector we have more than X streams with a changing structure? We will have to modify the code as many times as that structure changes (if we use the dynamic schema definition) or the JSON files in which these schemas are defined (if we use the static schema definition).
If a stream has more than 150 fields and we want to have all the data available in our DeNexus Trusted Ecosystem (for example, because we do not know the possible use cases that we can give to that data in the future) we will have to define a schema for each “stream” and if one of those fields' changes, our schema will also have to change.
Why we decided not to use Airbyte?
The approach we follow is different because, as we have seen above, we implement theELT paradigm, in other words, we extract all the data available by default and make it accessible in our DeNexus Trusted Ecosystem. By doing so, we can later define processes that will read and transform(T) ONLY the fields we are interested in, and we will NOT have to modify our pipelines or processes unless one of those fields' changes.
Spark SQL can automatically infer the schema of a JSON dataset and load it as a `Dataset[Row]`
Spark’s Schema inference is so powerful because it allows us to delegate that work. Of course, we can have conflicts in the types, but we do not have to deal with and resolve those conflicts every time we load data to our DeNexus Trusted Ecosystem, instead we can solve them only when they really affect our processes or use cases. In other words, when we really use that data.
In the previous section, we explained that DeNexus used Airbyte in this review excersise because it represents the best data integration tool of its kind, in the market to see if it could fit our needs and requirements. While some of the next points are specific to Airbyte , they are also applicable to the other data integration/ ETL tools in the marketplace:
Airbyte does not solve all the data extraction use cases, of course, we could create a connector with that functionality but why reinvent the wheel when there are specific tools to do so?
An Airbyte installation needs 2092 Mb of disk. Every time we would like to onboard a new customer, we would need 2 GB and +5 containers in that customer environment just to be able to start extracting data (i.e., to extract data from a simple API).
Strict rules to add new connectors. It is not clear how to add non-core data pipelines (i.e., how to add as a connector a web scrapping process).
We depend on third parties as we are using their suite.
They could change them pricing model and stop maintaining the open-source version.
A lot of work maintaining, deploying, and managing Airbyte by the DevOps team. Lots of components without use in our current model that we must deploy and maintain. Updates could be a nightmare due to the number of interrelations between components. More resources, more points of failure, and more code to maintain.
Actual version: v0.39.32-alpha.
What happens when a connector is not valid anymore because the source API (or equivalent way of extracting data) has changed? Do we have to wait until they implement the changes and release the latest version with the fix? Or will we have to develop our own connector in the meantime? They are not offering any kind of SLA.
“The success formula: solve your own problems and freely share the solutions.” ---Naval Ravikant
This article has described some of the benefits of potentially using data integration / ETL tools to standardize, centralize, and avoid reinventing the wheel in data pipelines; but it has been also described why these tools would not be a good fit for all the use cases.
When to use a data integration/ETL tool?
Your data sources are standard and well-know
You do not expect scalability to be a key factor in your data pipelines.
You have a centralized server from which to execute your data pipelines.
Your team has few Data Engineers.
When not to use a data integration/ETL tool?
Your data sources are not so standard.
Scalability is a key factor in your data pipeline.
Data pipelines are executed from different environments.
You have enough experienced Data Engineers in your team.
Your data security standards are highly restricted. Source, transit and at rest
Due to our extensive knowledge of cyber threats, thanks to the DeNexus Knowledge Center, we know the sensitivity of the data that we are extracting from our customers, so we implement highly restrictive requirements before using that data in our Data Lake (to improve our models, to calculate our customer’s cyber-risk or to offer them a list of suggested mitigations to implement and decrease their cyber-risk):
All customer data is encrypted before leaving their site.
Each customer has its own uniqueencryption key, created and managed with AWS Key Management Server(KMS).All customer data is de-identified before making it available in our DeNexus Trusted Ecosystem.
Click here to see how DeNexus' use of data integration tools helps our customers with cyber risk quantification and cyber risk mitigation.
If you would like to see DeRISK in action, click here.