Data warehousing and analytics are experiencing an intense metamorphosis. The reason for that is a rapid increase in quantity and size of data. Data volume is growing exponentially in all aspects of human everyday life (including the business world, stimulated by various factors. Nowadays, in any business unit of the enterprise, the decision process is based on data analysis and correct data management. Data-driven decisions accompany your every step.
For businesses to be able to manage their massive amount of data effortlessly and effectively and at the same time keep the security on a high level, moving to cloud technology is essential and crucial. This comprehension is one of the influences that formulates the current trends in the data world. The management of data and control of data flows gets more and more important. Companies tend to invest in cloud data warehouses more than ever before to be able to manage their data and to make the most of their valuable data assets.
Identifying the optimal cloud DWH is a challenge
The question of which cloud DWH is suitable for a company is a central topic of many of our customers. With this extensive experience of various DWHs in practice, we have conducted an in-depth analysis of the functions, capabilities and suitability of various providers. We now want to share with you the results of the analysis between the data warehouses of Amazon (AWS Redshift) and the uprising technology of Snowflake.
Snowflake and Redshift are among the leaders in the field of cloud data warehouses. Both are offering solutions for similar problems; they move the focus in decision making towards data based and data driven decisions. Redshift is a part of the larger cloud-computing platform Amazon Web Services. Snowflake is a fast growing software company which could separate the functions of storage and compute, and enable independent scaling before Google, Amazon, and Microsoft could do it.
More scalability and performance in the cloud
Although they both have unique features, there are great commonalities. Let us have a look at four key features, which they have in common:
Scaling and performance
First of all, they both offer independent scaling and separate storage and compute. By dividing the storage and compute, the company gains a higher level of flexibility in fulfilling the needs and requirements of data.Vertical and horizontal scaling facilitate the supply of the correct number, size, and format of resources and avoid extra costs of data center functioning. Moreover, they both offer concurrency scaling. This feature allows multiple users to perform parallel queries with the highest possible support, and fast query performance. This scaling adds resources to clusters as it is required, adding processing power during peak time and withdrawing it in less intense moments.
ETL tool and BI tool connection
Secondly (and thirdly), they both are integrated with ETL and BI tools and SQL is used as a primary language. A wide variety of ETL and BI tools offers flexibility in choosing the most powerful, useful and appropriate one. There is a great range of use cases in the business world, and it is unfeasible and unlikely to fit all of them in one solution. Even if a company does it, generally it leads to additional costs and poor adoption. The more tools you have to deal with the data, the higher performance will be achieved.
Native streaming
Additionally, both do not have native streaming functionality. However, both are able to microbatch. Micro-batch processing is the practice of collecting data in small groups (“batches”) for the purposes of taking action on (processing) that data. Contrast this to traditional “batch processing,” which often implies taking action on a large group of data. Micro-batch processing is a variant of traditional batch processing in that the data processing occurs more frequently so that smaller groups of new data are processed. In both micro-batch processing and traditional batch processing, data is collected based on a predetermined threshold or frequency before any processing occurs.
Data security
Finally, both DWHs keep data confidential. They both offer a high security level with industry-leading features which is provided by complex and deep encryptions. The security characteristics of Snowflake are linked to its editions, which enables the most suitable protection of the data. There are four editions: Standard Edition, Enterprise Edition, Business Critical Edition, and Virtual Private Snowflake (VPS). The first edition is the so-called introductory level with the standard level of protection. The second edition offers the same features as the standard one and features particularly required for large-scale companies. With the third edition, Business Critical Edition, you are getting a high level of protection for your utmost sensitive data, which should go along with HIPAA and HITRUST CSF regulations. The last edition provides you with the highest level of security support. It is recommended for those companies which have inflexible regulations, such as financial or banking sectors.
The decisive power of individual criteria
As we saw, Redshift and Snowflake have many features in common. However, for customers the differences play a decisive role. The analysis led to differences in the following aspects:
- Auto-Scaling
- Pricing models
- Maintenance
- Level of automation. management,
- Integrations
1. Auto-Scaling
Redshift doesn’t scale up and down fully automatically. This is because new nodes are added or removed to every cluster with any additional query. It may take from minutes to hours, depending on the size of the clusters, to scale and resize your data. Using an auto-scaling feature, Snowflake seamlessly scales without any delay or interruption.
If a lot of queries with a highly flexible query plan are to be performed in a short period of time, Snowflake is the option which provides a higher performance. On the other hand, if you have simpler workloads and consistently long usage patterns, Redshift will provide a great performance. Another decisive factor in selecting a cloud DWH would be the nature of queries. For complex queries or big data analytics, Snowflake would be better as it has more sophisticated SQL language.
2. Pricing structure
One of the main differences between both platforms is the pricing structure. Snowflake uses a time-based model and charges based on usage, which is equivalent to the time that queries are executed. Moreover, Snowflake charges separately compute usage and storage expenses.
As for Redshift, it calculates the price per hour per node, for both storage and computational power together. Customers are paying per cluster or for a particular amount of capacity. If a customer selects an on-demand pricing model, the costs can be lowered by applying the pause and resume actions. When the cluster is on pause mode, on-demand compute billing is stopped and resumed on a per-second basis. The pause and resume actions can be automated and scheduled according to the needs of the customer. In this way customers are saving a noticeable sum of money. If you are selecting the on-demand pricing model then the Redshift will be cheaper by virtue of its pause and resume feature.
3. Maintenance
The speed and easiness of the maintenance depends on the capability of the computation and storage operating individually. With Redshift, users have to look at the same cluster and compete over available resources. The necessity of managing queries via WLM queues comes along with an additional workload. WLM (Workload management) provides an adaptable and flexible feature of setting the priorities inside of the workloads in a way that fast-running queries are processed earlier than long-running queries to avoid the unnecessary delay for fast-running queries.
As for Snowflake, it is possible to have a smooth start of various data warehouses with different capacities that are enabled to review the same data without requiring its duplication or copy. Accordingly, each DWH can have their specific set of duties and tasks, and can be assigned to a particular user.
4. Level of automation
Snowflake is operated as a service. There is no need to control any hardware. Just by connecting to the service, setting up the data, and running queries, customers can manage the processes.
As for Redshift, it requires configuration. For the implementation of Redshift and the necessary configurations of clusters should be done, and storage and compute allocation should be set by the appropriate experts in the engineering team.
5. Connectors and APIs
The most relevant connectors and APIs for Snowflake are ODBC / JDBC access via drivers Access via Spark plugin (spark-snowflake) including Spark v3 Access via Kafka Python / Node.js / Go / .NET / PHP PDO drivers Native connectors (e.g. Python, Spark) Third-party connectors that can be used to connect applications such as ETL tools (e.g. Informatica) and BI tools (e.g. ThoughtSpot) to Snowflake. SQL API (RESTful API) via Snowpark.
As for Redshift, it offers the best integration capabilities with Amazon's suite of cloud services such as Kinesis Data Firehose, SageMaker, EMR, Glue, DynamoDB, Athena, Database Migration Service (DMS), Schema Conversion Tools (SCT), CloudWatch, and the most frequent connectors and APIs are ODBC / JDBC via AWS provided drivers, Redshift user interface in the AWS console for some node types, a new Data Access API, psql (Postgres command line tool).
Snowflake can be hosted on the Amazon Web Services (AWS) cloud platforms, on Google Cloud Platform (GCP) and Microsoft Azure (Azure). Particularly for hosting on AWS it requires a bit more effort to integrate with typical services of Amazon such as Kinesis, Glue, S3. Nevertheless, both DWHs have a rich portfolio of integrations.
For customers using various tools from different environments, Snowflake will be a good choice. If an enterprises' data stack contains mainly AWS services, Redshift will be more suitable from the point of integration.
Conclusion
Data warehousing is an important aspect that cannot be underestimated in data management. Considering Redshift and Snowflake, you cannot go wrong with any of them. Both are strong DWHs with well developed functions. The decisive factor is your organization’s needs. From our experience, it is essential to start with an examination of the status quo: what is the level of expertise, how is the workload patterned, how are queries designed, how are tasks and the work structured and distributed within the team. The next step is the definition of your aim and the scope: What do you want to achieve in the short-term and long-term? What is realistic? What can you invest (time, money)?
After setting the goal, you need to clarify the resources you require for achieving your goal. What functions and expertise are necessary? Afterwards, a tool should be selected, according to the identified needs, and start a small pilot/PoC. The final step is strategic planning of the implementation/migration. Our tool provides a great value in the initial process and all other phases of working with Cloud DWHs.