×

Access S3 Data in Amazon Redshift using Redshift Spectrum

Shiv Deshmukh
Shiv Deshmukh

Monday, June 21st, 2021

Read Time
5 min read

As the cloud data warehouse in AWS, Amazon Redshift provides seamless integration with other storages, such as Amazon S3. It enables a very cost-effective data warehouse solution, where the user can keep warm data in Amazon Redshift storage and cold data in S3 storage. The user can access the S3 data from Redshift in the same way, retrieving data from the Redshift storage itself. This blog post explains the process for doing just that.

Step 1: Pre-Requisite

You need to have an AWS account with administrative access.

Step 2: Create S3 Bucket and Upload Sample .csv File

Create S3 Bucket and Upload Sample .csv File

In this example, I uploaded a .csv file with data about specific countries. The countrydata.csv file looks like this:

countrydata.csv

Step 3: Create IAM Role

Your IAM Role for the Redshift cluster will be used to provide access to the data in the S3 bucket. Under Create Role in the IAM console, select AWS service. Under Choose a use case, select Redshift.

Create IAM Role

Next, under Select your use case, choose Redshift — Customizable.

Select your use case, choose Redshift — Customizable.

Under Attach permissions policies select PowerUserAccess.

Attach permissions policies

Next, create a role name and description.

create a role

At the end of all four steps, the summary page will show the details of the role created.

summary page

Step 4: Launch Redshift Cluster

In this step, you’ll launch a Redshift cluster, which will be used to query the S3-based data.

Go to Redshift Management Console and click on the Create cluster button.

Launch Redshift Cluster

Select an identifier for the cluster and select Free trial.

cluster and select Free trial

Note: Free trial (dc2.large) doesn’t allow cross-dBs queries. For initiating cross-dBs queries, the minimum requirement is to use paid instances of the ra3.xlplus node type.

Keep default database configurations, using a master username and master user password.

database configurations

You’ll now see the cluster you’ve created.

cluster

Associate cluster with the role created in step 3.

role created in step 3

You’ll see a summary of what you’ve created in the view shown below.

summary of what you’ve created

On the left-hand menu, select Editor using the options in the screenshot below to connect your cluster to the database you just created.

Editor using the options![](/assets/blog/access-s3-data-in-amazon-redshift-using-redshift-spectrum/1-N0mE6uwIyjQFDdC29JwTfg.png)

Step 5: Insert Data in Redshift Database

In this step, you’ll create a new schema in the Redshift cluster database and then create a table in the schema using the S3-based data. This table will be used to access data from the S3 bucket.

Insert Data in Redshift Database![](/assets/blog/access-s3-data-in-amazon-redshift-using-redshift-spectrum/1-4fQ2z-Tzc6XRRWXntHLfRw.png)

In the Editor view, enter the following code in the query editor:

create external schema trellisdataschema
from data catalog
database ‘dev’
iam_role ‘arn:aws:iam::YYYYYYYYYYYY:role/trellis-yyyyyy-role’
create external database if not exists;

Insert Data in Redshift Database_2

Next, enter code to create a table for that schema using the data from your .csv file stored in the S3 bucket:

create external table trellisdataschema.countrycapitals(

country nvarchar(100),

capital nvarchar(100)

)

row format delimited

fields terminated by ‘,’

stored as textfile

location ‘s3://trellis-redshift-s3’;

Amazon Redshift_ Query editor

The newly-created schema and table will now show up in the Resources view:

Amazon Redshift_Query editor

Now whenever you run a SQL query, you’ll fetch data from the table that you created, which fetches from the S3 bucket:

Amazon Redshift_Query editorRows returned

Running Cross-Database Queries

In order to execute a cross-database query, you need to upgrade your node type to ra3.xlplus:

Running Cross-Database Queries

Next, create the database dev1 :

create the database dev1

Now create the table — population:

Now create the table — population

Create a respective schema for db ‘dev1’:

Create a respective schema for db ‘dev1’

Insert data into this table:

Insert data into this table

Run the query to display your inserted data.

Run the query to display your inserted data

Create a second database and table, following the steps above.

Create the database dev2:

Create the database dev2

Create the table currency:

Create the table currency:

Create respective schema for db ‘dev2’:

Create respective schema for db ‘dev2’

Insert your data:

Insert your data

Run your query to display your data:

Run your query to display your data

Now that we have three DBs created (dev, dev1, and dev2) with their respective schemas in Redshift, we can run cross-database queries between these three databases:

respective schemas in Redshift

respective schemas in Redshift_2