Access S3 Data in Amazon Redshift using Redshift Spectrum
As the cloud data warehouse in AWS, Amazon Redshift provides seamless integration with other storages, such as Amazon S3. It enables a very cost-effective data warehouse solution, where the user can keep warm data in Amazon Redshift storage and cold data in S3 storage. The user can access the S3 data from Redshift in the same way, retrieving data from the Redshift storage itself. This blog post explains the process for doing just that.
Step 1: Pre-Requisite
You need to have an AWS account with administrative access.
Step 2: Create S3 Bucket and Upload Sample .csv File
In this example, I uploaded a .csv file with data about specific countries. The countrydata.csv file looks like this:
Step 3: Create IAM Role
Your IAM Role for the Redshift cluster will be used to provide access to the data in the S3 bucket. Under Create Role in the IAM console, select AWS service. Under Choose a use case, select Redshift.
Next, under Select your use case, choose Redshift — Customizable.
Under Attach permissions policies select PowerUserAccess.
Next, create a role name and description.
At the end of all four steps, the summary page will show the details of the role created.
Step 4: Launch Redshift Cluster
In this step, you’ll launch a Redshift cluster, which will be used to query the S3-based data.
Go to Redshift Management Console and click on the Create cluster button.
Select an identifier for the cluster and select Free trial.
Note: Free trial (dc2.large) doesn’t allow cross-dBs queries. For initiating cross-dBs queries, the minimum requirement is to use paid instances of the ra3.xlplus node type.
Keep default database configurations, using a master username and master user password.
You’ll now see the cluster you’ve created.
Associate cluster with the role created in step 3.
You’ll see a summary of what you’ve created in the view shown below.
On the left-hand menu, select Editor using the options in the screenshot below to connect your cluster to the database you just created.!(/assets/blog/access-s3-data-in-amazon-redshift-using-redshift-spectrum/1-N0mE6uwIyjQFDdC29JwTfg.png)
Step 5: Insert Data in Redshift Database
In this step, you’ll create a new schema in the Redshift cluster database and then create a table in the schema using the S3-based data. This table will be used to access data from the S3 bucket.!(/assets/blog/access-s3-data-in-amazon-redshift-using-redshift-spectrum/1-4fQ2z-Tzc6XRRWXntHLfRw.png)
In the Editor view, enter the following code in the query editor:
create external schema trellisdataschema from data catalog database ‘dev’ iam_role ‘arn:aws:iam::YYYYYYYYYYYY:role/trellis-yyyyyy-role’ create external database if not exists;
Next, enter code to create a table for that schema using the data from your .csv file stored in the S3 bucket:
create external table trellisdataschema.countrycapitals( country nvarchar(100), capital nvarchar(100) ) row format delimited fields terminated by ‘,’ stored as textfile location ‘s3://trellis-redshift-s3’;
The newly-created schema and table will now show up in the Resources view:
Now whenever you run a SQL query, you’ll fetch data from the table that you created, which fetches from the S3 bucket:
Running Cross-Database Queries
In order to execute a cross-database query, you need to upgrade your node type to ra3.xlplus:
Next, create the database dev1 :
Now create the table — population:
Create a respective schema for db ‘dev1’:
Insert data into this table:
Run the query to display your inserted data.
Create a second database and table, following the steps above.
Create the database dev2:
Create the table currency:
Create respective schema for db ‘dev2’:
Insert your data:
Run your query to display your data:
Now that we have three DBs created (dev, dev1, and dev2) with their respective schemas in Redshift, we can run cross-database queries between these three databases: