Get the report
MoreMonitor Amazon Redshift with Sumo Logic
Proactively monitor, identify, and resolve issues faster
August 13, 2019
In this blog series, we will cover how Amazon Redshift and Sumo Logic deliver best-in-class data storage, processing, analytics, and monitoring. In this first post, we will discuss how Amazon Redshift works and why it is the fastest growing cloud data warehouse in the market, used by over 15,000 customers around the world.
When an organization gains traction, the size of data that needs to be stored, monitored, and analyzed expands exponentially. On traditional database warehouses, queries will start taking more time, making data difficult to manage.
With the rise of cloud computing, the need for warehousing solutions that can scale up for the increasing demands of data storage and analysis has been apparent, resulting in organizations looking for alternatives to traditional on-premise warehousing.
AWS’s Amazon Redshift is a direct response to this demand.
Amazon Redshift (also known as AWS Redshift) is a fully-managed petabyte-scale cloud based data warehouse product designed for large scale data set storage and analysis. It is also used to perform large scale database migrations.
Redshift’s column-oriented database is designed to connect to SQL-based clients and business intelligence tools, making data available to users in real time. Based on PostgreSQL 8, Redshift delivers fast performance and efficient querying that help teams make sound business analyses and decisions.
AWS Redshift is a data warehouse product built by Amazon Web Services. It's used for large scale data storage and analysis, and is frequently used to perform large database migrations.
Each Amazon Redshift data warehouse contains a collection of computing resources (nodes) organized in a cluster. Each Redshift cluster runs its own Redshift engine and contains at least one database.
Redshift is Amazon’s analytics database, and is designed to crunch large amounts of data as a data warehouse. Those interested in Redshift should know that it consists of clusters of databases with dense storage nodes, and allows you to even run traditional relational databases in the cloud.
Redshift is a fully managed cloud data warehouse. It has the capacity to scale to petabytes, but lets you start with just a few gigabytes of data. Leveraging Redshift, you can use your data to acquire new business insights.
There is a clear distinction between Amazon S3 and AWS Redshift. While both are Amazon Web Service products, S3 is used for product storage, and AWS Redshift is a data warehouse.
AWS Redshift was designed for online analytic processing and BI tools. This means any processing that requires complex queries and large datasets will be an ideal use case for Amazon Redshift.
Proactively monitor, identify, and resolve issues faster
Amazon Redshift is a direct alternative to on-premise traditional database warehouses. Let’s look at how Redshift stacks up to traditional warehousing in the following areas:
Amazon Redshift is most known for its speed. Redshift delivers the fast query speeds on large data sets, dealing with data sizes up to a petabyte and more. The speed by which Redshift processes data up to these sizes is just simply impossible to attain in traditional data warehousing, making it the top choice for applications that run massive amounts of queries on-demand.
The ability to deliver this level of performance comes with the use of two architectural elements: columnar data storage and massively parallel processing design (MPP). We will delve deeper into these two later.
Amazon Redshift is markedly faster than traditional warehousing--but when it comes to choosing tech solutions, organizations are arguably most concerned about cost.
As a cloud-based solution, Amazon Redshift is able to provide high-level performance affordably. IT executives know that traditional warehousing is extremely costly from the beginning, with the initial outlay for hardware possibly costing up to the multi-millions. On the other hand, there are no substantial upfront costs to getting setup and started with Redshift. Being a fully-managed solution, Redshift has no recurrent hardware and maintenance costs. Database admins cans setup data warehouses that can handle massive amounts of data without having to go through the lengthy process of procurement and strategic buy-in from leadership that multi-million-dollar on-premise hardware requires.
Traditional on-premise data warehousing poses quite the challenge in case your data needs increase or decrease.
For traditional warehousing, when organizations data needs change, they are forced to have to make another round of costly investments for new hardware purchase and implementation.
Redshift allows for more flexibility and elastic scale. As your requirements change, Redshift can scale up or down instantly to match your capacity and performance needs with a few clicks in the management console.
Cost-wise, on-demand pricing ensures you only pay for what you use. Not being tied down to expensive hardware and lengthy maintenance contracts mean organizations have the liberty to change their minds without having to eat up sunk costs. From a single 160GB DC1.Large node all the way up to multiple 16TB DS2.8XLarge nodes for a petabyte or more of data, you have access to processing power on-demand.
Although Amazon Redshift is demonstrably better than traditional warehousing in the abovementioned regards, security remains to be the tipping point for many enterprises--but it’s not because of known security vulnerabilities. The reality is that some still feel concerned about not having their data physically present.
That said, security is a topmost concern for Amazon, knowing this is a salient point in the decision making for warehousing solutions.
Amazon follows the shared responsibility model of security where Amazon is responsible for the security of the cloud, and the organization is responsible for security in the cloud.
That said, Amazon Redshift has most security features of the larger Amazon Web Services platform. Credentials and access are granted and managed on the AWS-level through Identity and Access Management (IAM) accounts. Cluster security groups are created and associated with data clusters for inbound access. For orgs that use a private cloud, access through a Virtual Private Cloud (VPC) environment is available as well. Data encryption is also enabled upon cluster creation and cannot be switched from encrypted to unencrypted directly.
For data in transit, Redshift uses SSL encryption to communicate with S3 or Amazon DynamoDB for COPY, UNLOAD, backup, and restore operations.
As mentioned above, Amazon Redshift is able to deliver performance with best-in-class speed due to the use of two main architectural elements: Massively Parallel Processing (MPP) design and columnar data storage. Let’s look at each one and see how they enable fast processing in Redshift.
Redshift’s Massively Parallel Processing (MPP) design automatically distributes workload evenly across multiple nodes in each cluster, enabling speedy processing of even the most complex queries operating on massive amounts of data. Multiple nodes share the processing of all SQL operations in parallel, leading up to final result aggregation. Users can optimize the distribution of data by locating the data where it needs to be before the query is executed. This is done by choosing the appropriate distribution style, minimizing the impact of the redistribution step.
By using columnar storage for database tables, Amazon Redshift reduces the disk I/O requirements, contributing to the optimization of analytic query performance. When database table information is stored in a columnar fashion, the number of disk I/O requests and the amount of data needed to be loaded from disk are reduced. When less data is loaded into memory, Redshift can perform more in-memory processing for executed queries. The amount of time needed to perform a query is reduced using this method compared to when data is stored by row.
To get started with Amazon Redshift, you need an AWS account. You may start with a free trial if you don’t already have an account.
You would also need to ensure that you have an open port that Redshift can use. By default, Redshift will use port number 5439 but the connection will not work if that port is not open in your firewall. Either make sure that port is open or identify an open port in your firewall and input the open port number when you create the cluster. The port number cannot be changed once the cluster has been created.
To access resources on another AWS resource like Amazon S3, the Redshift cluster you’re about to create needs the necessary access permissions. Those permissions can only be provided in two ways:
You can create an IAM role by following these instructions from AWS.
After completing the prerequisites, you’re ready to launch a Redshift cluster.
After following the steps, the Redshift cluster is now launched. To connect to the cluster, you need to configure a security group to authorize access. If the cluster is launched in the EC2-VPC platform, follow these instructions from AWS.
Now that you have launched a cluster, you may connect to it and start running queries. Running queries can be done in two ways:
At this point, you can now use your Redshift cluster. You can create tables in the database, upload data to the tables, and try running queries. These activities can be done through the AWS Query Editor or through a SQL client tool of your choice.
Now you know how Amazon Redshift works and why it’s fast and efficient. Still, the best way to know for sure is to see its performance for yourself by monitoring performance. In the next blog posts in this series, we will take a deep dive into how to analyze Redshift queries and how to monitor Amazon Redshift performance with Sumo Logic. Stay tuned.
Proactively monitor, identify, and resolve issues faster
Build, run, and secure modern applications and cloud infrastructures.
Start free trial