Saturday, November 16, 2024
HomeBig DataCarry out ETL operations utilizing Amazon Redshift RSQL

Carry out ETL operations utilizing Amazon Redshift RSQL

[ad_1]

Amazon Redshift is a quick, scalable, safe, and absolutely managed cloud knowledge warehouse that makes it easy and cost-effective to research all of your knowledge utilizing normal SQL and your current ETL (extract, rework, and cargo), enterprise intelligence (BI), and reporting instruments. Tens of 1000’s of shoppers use Amazon Redshift to course of exabytes of information per day and energy analytics workloads resembling BI, predictive analytics, and real-time streaming analytics.

There are numerous methods to work together with Amazon Redshift. You possibly can programmatically entry knowledge in your Amazon Redshift cluster by utilizing the Amazon Redshift Knowledge API, or you need to use a web-based interface resembling Amazon Redshift Question Editor V2 to creator SQL queries. You too can work together with Amazon Redshift in interactive or batch mode by way of Amazon Redshift RSQL.

Beforehand, you had to make use of the PostgreSQL psql command line instrument to work together with Amazon Redshift for easy use instances resembling importing and exporting knowledge to and from Amazon Redshift or submitting SQL in batch or interactive mode, or for superior use instances resembling performing ETL operations with out writing complicated ETL jobs.

Now you need to use Amazon Redshift RSQL, a brand new command line instrument to connect with an Amazon Redshift cluster and carry out operations resembling describe database objects, question knowledge, view and export question leads to varied output file codecs, run scripts that embrace each SQL and sophisticated enterprise logic, carry out knowledge cleaning and transformation operations utilizing acquainted SQL, and write ETL and ELT scripts utilizing enhanced management circulation and superior error dealing with. Furthermore, if you happen to’re migrating from self-managed knowledge warehousing engines resembling Teradata to Amazon Redshift and utilizing Teradata BTEQ (Fundamental Teradata Question) scripting for knowledge automation, ETL or different duties can now migrate to Amazon Redshift RSQL.

This put up explains find out how to use Amazon Redshift RSQL for ETL, knowledge cleaning and preparation, enhanced management circulation, and exception and error dealing with.

Resolution overview

Amazon Redshift RSQL permits you to connect with your Amazon Redshift cluster by way of a command line. It helps the capabilities of the PostgreSQL psql command line instrument with a further set of Amazon Redshift particular capabilities:

  • Describe properties or attributes of exterior tables in an AWS Glue catalog or Apache Hive metastore, exterior databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, and tables shared utilizing Amazon Redshift knowledge sharing
  • Use enhanced management circulation instructions resembling IF, ELSEIF, ELSE, ENDIF, GOTO, and LABEL
  • Use single sign-on (SSO) authentication utilizing ADFS, PingIdentity, Okta, Azure AD, or different SAML/JWT based mostly id suppliers (IdPs), in addition to use browser-based SAML IdPs with multi-factor authentication (MFA)

The next are some generally used instructions in Amazon Redshift RSQL. We use these instructions on this put up to show completely different attainable use instances utilizing Amazon Redshift RSQL scripts.

  • echo – Prints the required string to the output.
  • comment – An extension of the echo command that has the flexibility to interrupt the output over separate traces.
  • goto – Skips all intervening instructions and SQL statements and resume the processing on the specified LABEL in sequence. The LABEL should be a ahead reference. You possibly can’t bounce to a LABEL that lexically precedes the GOTO.
  • label – Establishes an entry level for operating this system because the goal for a GOTO command.
  • exit – Stops operating Amazon Redshift RSQL. You too can specify an non-obligatory exit code.
  • q – Logs off database classes and exits Amazon Redshift RSQL.

Stipulations

The next are the stipulations for utilizing Amazon Redshift RSQL and carry out the steps on this put up:

  • An AWS account
  • Linux, Home windows, or MacOs working system (Amazon Redshift RSQL is out there for these working programs)
  • An Amazon Redshift cluster
  • SQL data

Moreover, full the next prerequisite steps:

  1. Set up Amazon Redshift RSQL in your native machine. For directions, discuss with Getting began with Amazon Redshift RSQL.
  2. Join to the Amazon Redshift cluster.
  3. Create the orders and orders_summary tables utilizing the next DDL statements:
    create desk orders (
      O_ORDERKEY bigint NOT NULL,
      O_CUSTKEY bigint,
      O_ORDERSTATUS varchar(1),
      O_TOTALPRICE decimal(18,4),
      O_ORDERDATE Date,
      O_ORDERPRIORITY varchar(15),
      O_CLERK varchar(15),
      O_SHIPPRIORITY Integer,
      O_COMMENT varchar(79))
    distkey (O_ORDERKEY)
    sortkey (O_ORDERDATE);
    
    CREATE TABLE orders_summary 
    ( o_orderkey bigint, 
     o_custkey bigint, 
     o_orderstatus character various(1),
     o_totalprice integer,
     target_information character various(14),
     rank character various(15),
     description character various(15)
    ) DISTSTYLE AUTO;

Import knowledge into the Amazon Redshift cluster

There are a number of methods to load knowledge into Amazon Redshift tables, together with utilizing Amazon Redshift RSQL. On this part, we assessment the syntax and an instance of the Amazon Redshift RSQL script to load knowledge into an Amazon Redshift desk utilizing the COPY command.

We use the next syntax:

COPY <TABLE> from <location> 
iam_role <arn>
area <area>;

We offer the next parameters:

  • <location> – The placement of the supply knowledge to be loaded into the goal desk
  • <arn> – The AWS Id and Entry Administration (IAM) function for accessing the information
  • <area> – The AWS Area of the supply knowledge

Within the following instance Amazon Redshift RSQL script, we load knowledge from an Amazon Easy Storage Service (Amazon S3) bucket location into the orders desk:

echo 'Job began' 
  copy orders from 's3://redshift-immersionday-labs/knowledge/orders/orders.tbl.'
  iam_role default
  area 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;
echo 'Job Ended'
exit 0 

Enhanced management circulation

You should use Amazon Redshift RSQL to outline programmatic enhanced management circulation and conditional blocks in your ETL script. We use the next syntax:

if <situation> 
  <code_block1>
else
  <code_block2>
endif

The syntax contains the next parts:

  • <situation> –The conditional assertion
  • <code_block1> – The code block to run when the situation is met
  • <code_block2> – The code block to run when the situation isn’t met

Within the following instance script, we carry out some conditional checks utilizing if, elseif, and else instructions based mostly on the rely of data from the orders desk, and we show some messages based mostly on the file rely worth:

echo 'Job began'
Choose rely(*)  from orders gset
choose :rely as rely;
if :rely < 76000000 
  echo 'Orders are lower than goal'
elseif :rely =76000000
  echo 'Order met the goal'
else :rely > 76000000
  echo 'Orders exceeded the goal'
endif
echo 'Job Ended' 
exit 0  

Error dealing with

You should use Amazon Redshift RSQL to outline exception dealing with blocks in your in ETL script to deal with varied user-defined and system-generated error situations that you just would possibly encounter whereas operating the ETL course of.

The next are among the error dealing with choices supported in Amazon Redshift RSQL:

  • :ACTIVITYCOUNT – This variable is much like the psql variable ROW_COUNT, which returns the variety of rows affected by final SQL assertion. Nevertheless, ACTIVITYCOUNT stories the affected row rely for SELECT, COPY, and UNLOAD statements, which ROW_COUNT doesn’t. For SQL statements returning knowledge, ACTIVITYCOUNT specifies the variety of rows returned to Amazon Redshift RSQL from the database.
  • :ERRORCODE – This variable accommodates the return code for the final submitted SQL assertion to the database. Error code 0 specifies that SQL assertion accomplished with none errors.
  • :ERRORLEVEL – This variable is used to assign severity ranges to errors. You should use these severity ranges to find out a plan of action. The default worth is ON.
  • :MAXERROR – This variable designates a most error severity stage past which Amazon Redshift RSQL halts job processing. If SQL statements in Amazon Redshift RSQL scripts produce an error severity stage higher than the designated maxerror worth, Amazon Redshift RSQL instantly exits.
  • :LAST_ERROR_MESSAGE – This variable accommodates the error message of the latest failed question.

We use the next syntax:

if :ERROR <> 0 
  echo :<ERRORCODE>
  echo :<LAST_ERROR_MESSAGE>
  goto <codeblock1>
else
  goto Y
endif

The syntax contains the next info:

  • <ERRORCODE> –The error code quantity
  • <LAST_ERROR_MESSAGE> – The error message of the latest failed question
  • <code_block1> – The code block to run when the error situation is met
  • <code_block2> – The code block to run when the error situation isn’t met

Within the following instance script, we create the orders_staging desk and replica data into the desk from an Amazon S3 location. The script additionally accommodates an exception dealing with part for each the desk creation and replica course of to deal with the attainable errors encountered through the course of.

echo `date`
echo 'Job began'
DROP TABLE IF EXISTS orders_staging;

create desk orders_staging (
O_ORDERKEY bigint NOT NULL,
O_CUSTKEY bigint,
O_ORDERSTATUS varchar(1),
O_TOTALPRICE decimal(18,4),
O_ORDERDATE Date,
O_ORDERPRIORITY varchar(15),
O_CLERK varchar(15),
O_SHIPPRIORITY Integer,
O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

if :ERROR <> 0 
  echo :ERRORCODE
  comment :LAST_ERROR_MESSAGE
  goto QUIT_JOB
else
  comment '***Orders_Staging Desk Created Efficiently****'
  goto COPY_DATA
endif

label COPY_DATA
 copy orders_staging from 's3://redshift-immersionday-  labs/knowledge/orders/orders.tbl.'
 iam_role default
 area 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

if :ERROR <> 0
  echo :ERRORCODE
  comment :LAST_ERROR_MESSAGE
  goto QUIT_JOB
else 
  comment '****Knowledge Copied Efficiently****'
endif

echo `date`
echo 'Job Ended'
exit 0 

label QUIT_JOB
 echo `date`
 echo 'Job Failed'
 exit 1 

Knowledge transformation and preparation

You possibly can carry out some frequent knowledge preparation and transformation operations in your dataset utilizing SQL statements within the Amazon Redshift RSQL ETL script. On this part, we show knowledge transformation and preparation operations resembling casting, new knowledge column creation, and splitting an enter column into a number of columns.

We use business normal SQL statements for remodeling and making ready knowledge for downstream consumption.

Within the following instance script, we use a SQL assertion to remodel the information from the orders_staging desk and insert it into the orders_summary desk:

echo `date`
echo 'Job began'

insert into orders_summary 
(o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice,
target_information,
rank,
description)
choose
o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice::int,
case 
when o_totalprice < 200000
then 'Goal Missed'
when o_totalprice = 200000
then 'Goal Met'
when o_totalprice > 200000
then 'Goal Exceeds'
finish as "Target_Information",
split_part (o_orderpriority,'-',1) as RANK, 
split_part (o_orderpriority,'-',2) as DESCRIPTION
from orders_staging; 

if :ERROR <> 0 or :ACTIVITYCOUNT=0
 echo :ERRORCODE
 comment :LAST_ERROR_MESSAGE
 goto QUIT_JOB
else
 comment 'Knowledge Inserted into Abstract Orders Desk'
endif

echo `date`
echo 'Job Ended'
exit 0 

label QUIT_JOB
 echo `date`
 echo 'Job Failed'
 exit 1 

Export knowledge from an Amazon Redshift cluster and output file formatting choices

You should use Amazon Redshift RSQL to extract knowledge from one or a number of Amazon Redshift tables and write to your disk for consumption by downstream purposes. Amazon Redshift RSQL makes use of the EXPORT choice to export the results of question to an output file.

The next are among the helpful output formating choices supported in RSQL:

  • rset rformat on – This command is required for all of the formatting instructions to take impact.
  • pset format – Codecs can embrace aligned, AsciiDoc, CSV, HTML, LaTeX, LaTeX longtable, troff-ms, unaligned, and wrapped.
  • pset border – This feature specifies border info for output knowledge. Worth 0 means no border, 1 means inside dividing traces, and a couple of means desk body.
  • rset heading – This command provides the required heading to the output outcome.
  • rset rtitle – This command provides the required heading in addition to present system date of the consumer pc.
  • rset titledashes on/off – This command specifies whether or not to print a line of sprint characters between the column names and column knowledge returned for the SQL question.
  • rset width – This command specifies the goal width for every line in a report
  • t – This command turns off printing column names in addition to outcome row rely on the finish of the output (footers).

We use the next syntax:

export report file=<'FilePath/Filename'>
rset rformat on
pset format wrapped
pset border 2
rset heading ‘That is Heading’
rset width 50
rset titledashes on
<SQL Question>
export reset

We offer the next info:

  • <‘FilePath/Filename’> – The file identify and path for the output file
  • <SQL Question> – The SQL question to run

Within the following instance script, we export the information from the orders_summary desk utilizing a SQL question and write it into the orders.txt textual content file on the native machine:

echo `date`
echo 'Job began'

export report file="/<FilePath>/orders.txt"
rset rformat on
pset format wrapped
rset width 50
rset titledashes on
choose * from orders_summary restrict 100;

export reset
echo 'Job Ended'
echo `date`
exit 0 

Automate the Amazon Redshift RSQL ETL script

One of many choices to automate Amazon Redshift RSQL scripts to run on a particular schedule is by way of shell scripting. You possibly can schedule the shell script by way of a CRON job, a command line utility.

We use the next syntax:

#!/bin/sh
rsql -D awsuser -f <RSQL_SCRIPT> <LOG_FILE>

We offer the next info:

  • <RSQL_SCRIPT> – The SQL scripts to un
  • <LOG_FILE> – The output log file

Within the following instance shell script, we run the Amazon Redshift RSQL script that we created and write the output log in a log file within the native machine. You possibly can schedule the shell script by way of a CRON job.

#!/bin/sh
SCRIPTS_DIR="<SCRIPTS_DIR>"
LOG_DIR="<LOG_DIR>"

RSQL_SCRIPT="${SCRIPTS_DIR}/<RSQL_SCRIPT>.sql"
LOG_FILE="${LOG_DIR}/test_rsql.log"

contact $LOG_FILE

rsql -D awsuser -f ${RSQL_SCRIPT} > ${LOG_FILE}

Clear up

To keep away from incurring future costs, cease the Amazon Redshift cluster created for the aim of this put up.

Conclusion

On this put up, we defined find out how to use Amazon Redshift RSQL to carry out ETL operations. We additionally demonstrated find out how to implement superior error dealing with and enhanced management circulation in your Amazon Redshift RSQL ETL script.

When you’re utilizing scripts by way of the psql command line consumer on Amazon Redshift, you’ll be able to function on Amazon Redshift RSQL with no modification. When you’re migrating your Teradata BTEQ scripts to Amazon Redshift RSQL, you need to use the AWS Schema Conversion Software (AWS SCT) to routinely convert BTEQ scripts to Amazon Redshift RSQL scripts.

To be taught extra, discuss with Amazon Redshift RSQL.


Concerning the Authors

Saman Irfan is a Specialist Options Architect at Amazon Net Providers. She focuses on serving to prospects throughout varied industries construct scalable and high-performant analytics options. Outdoors of labor, she enjoys spending time along with her household, watching TV collection, and studying new applied sciences.

Sudipta Bagchi is a Specialist Options Architect at Amazon Net Providers. He has over 12 years of expertise in knowledge and analytics, and helps prospects design and construct scalable and high-performant analytics options. Outdoors of labor, he loves operating, touring, and taking part in cricket.

[ad_2]

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments