Outshift | Setup AQUA (Advanced Query Accelerator) for Amazon Redshift

AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to ten times faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA is included with the Redshift RA3 ra3.4xl and ra3.16xl node types at no additional cost. This post will describe how to set up AQUA for Amazon Redshift. Note that AQUA executes on data in “regular” Redshift tables — i.e. data that is ingested into Redshift. It doesn’t support federated queries run from Redshift against databases like maria DB or aurora DB. AQUA is designed to deliver up to 10 times the performance on queries that perform large scans, aggregates, and filtering, which is limited to LIKE and SIMILAR_TO predicates. Queries that perform writes, such as INSERT, UPDATE, DELETE, CREATE TABLE AS, COPY, and UNLOAD and SELECT (without a predicate) are not supported currently. Over time, AWS will be adding support for additional queries.

Launch Redshift Cluster

Go to Redshift Management Console and click on the “Create cluster” button. Select Production under “What are you planning to use this cluster for?” Then choose the RA3 ra3.4xl node type and turn on AQUA.

Keep default database configurations, using a master username and master user password.

You’ll now see the cluster you’ve created.

Comparing queries run in Redshift, Aurora DB with and without AQUA cluster.

To assess the improvement in performance from AQUA, let’s consider this largetest_2 table, which has 500,000,000 rows consisting of num1(int), num2 and num3 (double precision).

We ran the following two queries for selecting num1, average of num3 and sum of num3, on the data shown in the table above: Query #1

SELECT num1, avg(num3) as num3_avg, sum(num2) as num2_sum
FROM dev.public.largetest_2
WHERE num2 LIKE ‘0.2%’
GROUP BY num1;

Query #2

SELECT num1, avg(num3) as num3_avg, sum(num2) as num2_sum
FROM dev.public.largetest_2
WHERE num3 LIKE ‘0.2%’
GROUP BY num1;

The Query #1’s response time is 29s.

The Query #2’s response time is 1m 1s.

We ran the same Query #1 in a non-AQUA (see above table) and got a response time of 57s.

Query #2’s response time is 2m 8s

Next, we ran the same queries on the above table against Aurora DB in the Redshift AQUA cluster. Query #1's response time was 47m 51s.

We then ran a query on the above table against Aurora DB in the Redshift non-AQUA cluster. Query #1's response time was 47m 52s.

A Query explanation:

SELECT num1, sum(num2) as num2_sum, avg(num3) as num3_avg FROM largetest_3 GROUP BY num1;

The above query will select rows (highlighted in green in Table 1 above), group by num1 column to calculate sum for num2 and the average for num3 columns. In other words, when grouping by value ‘3’ (from num1), the query will calculate sum for 0+1=1(from num2) and the average for (77+85)/2=81(from num3). We will get the following response for this query:

SELECT num1, sum(num2) as num2_sum, avg(num3) as num3_avg
FROM largetest_3
WHERE num2 LIKE ‘1%’
GROUP BY num1;

The above query will select rows (highlighted in orange in Table 1 above), only where num2 has ‘1%’ — i.e. the query will only return values that start with ‘1’ — and group by the num1 column to calculate the sum for num2 and the average for num3 columns. In other words, when grouping by value ‘4’ (from num1), the query will calculate the sum for 1+1=2(from num2) and the average for (114+112)/2=113 (from num3). We will then get the following response for above query.

Response Time Summary for different configurations

The table below shows the comparison of queries run in Redshift (Local) and Aurora DB (Remote) with and without the AQUA cluster. These queries were run on a database table with 500,000,000 rows mix of integer and double precision data types.

The above table clearly shows the performance improvements in terms of response time with Redshift AQUA cluster as compared to non-AQUA cluster.

RedShift Cost Analysis

The cost of using Redshift can be broken into the following high-level categories:

Redshift managed storage associated with nodes in a cluster — Pay for data stored in managed storage at a fixed GB-month rate for your region.
Data scanned while running queries — With Redshift Spectrum, you are billed per terabyte of data scanned, rounded up to the next megabyte, with a 10-megabyte minimum per query.
Redshift Compute Node Usage — hourly rate based on the type and number of nodes in your cluster.

Click here for AWS link for Redshift pricing: https://aws.amazon.com/redshift/pricing/