Best practices for open source data catalogs

by
Mike Linthe
March 18, 2022

Nowadays, you can find the right solutions for every data challenge - among them are more and more open source tools. And although some (such as Linux, Python, Postgres and many more) are now part of almost every tech stack, there are still concerns in many organizations about using other OS solutions. Typical concerns are:

  • the security
  • the effort for implementation and updates 
  • lack of experience in maintenance 
  • the long-term availability of the solution
  • the lack of a central contact person

These concerns are the same as those that exist (or should exist!) with other software implementations. However, in contrast to commercial solutions, a critical review often does not occur with OS solutions. This often wrongly means an early knockout for the latter. Yet OS solutions can bring many advantages. Some of them are:

  • Low or no licensing costs
  • More flexibility and better possibilities for individualization
  • Faster further development and higher compatibility with new solutions
  • Free exchange of best practices in the community

However, to ensure that you are optimally prepared for your tool evaluation and discussions, we present our best practices for the introduction and maintenance of OS data catalogs below. With their help, you will be able to realistically assess how compatible an OS data catalog is with your processes and how much additional work it will mean. 

Central best practices for OS tools

Initial evaluation

At the beginning of every tool procurement is the evaluation of the tool. Questions to be answered are:

  • To what extent does the tool meet your requirements?
  • Does it fit into the tool stack? 
  • Is the solution trustworthy, stable and likely to be on the market for the long term?

____________________________________________________________________________________________

For more certainty regarding the question of trustworthiness, stability and sustainability, we recommend checking the following three core criteria. Not all OS projects are truly suitable for enterprise use.

1. Size of the project community: Project communities vary greatly in size. They can comprise a single person or up to several 10,000 members. Accordingly, communities also vary in organization and structure. For applications in an enterprise context, you should pay attention to a large, well-organized community. This also ensures the long-term existence of the tool. 

Our tip: Typically, communities organize themselves either in forums or modern communication platforms like Slack. These are also good starting points for first steps and questions.

2. Published software development process: Large projects usually follow a defined software development process. This can be read in the community guidelines. The existence of a cleanly set up development process is another important criterion.  

3. Speed and frequency of improvements: Since the code is freely accessible, potential risks and vulnerabilities of projects with large communities (and many pairs of eyes) are quickly identified and quickly fixed. The company should get an impression of the problems found and fixed in the code history. The shorter the times there are, the better this is for security in most cases. 

4. Establishing the solution: If the solution is already in use at other companies, this proves its suitability and is at the same time another driver for the solution to be available in the long term. After all, these companies also have an interest in continuous further development.

____________________________________________________________________________________________

In the case of open source solutions, a number of issues need to be examined particularly carefully in the evaluation step: 

  • Does the tool fit the expected user group and their skills? 
  • Does the tech team have the required technical background knowledge? 

For more tips, see the articles "Advantages and disadvantages of an open source data catalog" and "Open source data catalogs for large enterprises?"

Transparent communication

That there are some differences between commercial solutions and OS solutions cannot be denied. You should clearly communicate advantages and disadvantages to all stakeholders. In our experience, it pays to invest more time in clarification at the beginning of a project. A veto that appears late in the process because stakeholders were not well informed is at least annoying in any case.

Planning

It is worthwhile to establish a tool management system. This does not have to be a spectacular tool, but can also be a simple inventory. This not only helps to maintain an overview, but also makes maintenance processes easier to manage. 

If broader use of open source solutions is planned, it is advisable to address deployment guidelines early on. The guidelines provide the framework for various questions: In which areas can OS tools be used and in which not? What is the authorization process for them? What is the procedure for POCs? Which setups are possible and which are not? In our experience, companies should not set these in stone, but revise them regularly. OS tools have the advantage of a high pace of innovation - and after all, companies want to benefit from this, not block it internally. 

Deployment and updates

Typically, the CI/CD (continuous integration, continuous delivery/continuous deployment) approach is used in the open source area. Deployment is often to a Kubernetes cluster. The Helm configuration framework has proven itself enormously. 

New releases are usually available at a much higher frequency than with commercial, proprietary solutions, e.g. 1x per month. The functions are packaged for this purpose and hosted as a Docker image on an image server. 

Once a company has set up the release process cleanly, a good 95% of it can be automated. The whole process then takes only a few minutes. 

The following applies to updates: Install as often as possible, but wait about two weeks after the release date. This ensures both stability and security. It is also advisable to check the release notes carefully to adjust individual elements if necessary. 

Tests

Make sure you test key features before the update goes into production. In most cases, organizations have a separate structure of development environment and production environment and clearly defined procedures and documentation guidelines. If not, this is highly recommended at this point. Similarly, automating tests reduces the workload of IT staff, especially in the long term. 

In general, the scope of tests to be performed always depends on the application, the surrounding infrastructure and the function. The greater the risk in the event of a malfunction, the more meticulous the testing. If your company relies on CI/CD, as recommended under "Deployment and Updates", the amount of testing required is usually manageable. 

Conclusion

You will have noticed that these best practices a) are not witchcraft and b) can of course be transferred very well to all OS tools and also many commercial software areas. However, depending on how software deployment processes are currently set up in your company, an OS tool may well require modernization of current procedures. Our experience is that it is worthwhile to modernize these processes now. Tool development cycles are getting shorter and data architectures more individual. Consider the introduction of an OS tool as an opportunity to enable innovation and test the future.

Are you interested in more information? Let's talk about your current topics and challenges!
Contact