Saturday, December 21, 2024
HomeBig DataIngest knowledge from Snowflake to Amazon S3 utilizing AWS Glue Market Connectors

Ingest knowledge from Snowflake to Amazon S3 utilizing AWS Glue Market Connectors

[ad_1]

In at present’s advanced enterprise panorama, organizations are challenged to devour from number of sources and sustain with knowledge that pours in all by the day. There’s a demand to design purposes that permits knowledge to be transportable throughout cloud platforms and provides them the power to derive insights from a number of knowledge sources to stay aggressive. On this publish, we reveal how AWS Glue integration with Snowflake has simplified the method of connecting to Snowflake and making use of knowledge transformations with out writing a single line of code. With AWS Glue Studio, now you can use a easy visible interface to compose jobs for migrations that transfer and combine knowledge. It allows you to subscribe to a Snowflake connector in AWS Market, question Snowflake tables and save the info in Amazon Easy Storage Service (Amazon S3) as Parquet format.

In the event you select to carry your personal customized connector or desire a special connector from AWS Market, comply with the steps on this weblog Performing knowledge transformations utilizing Snowflake and AWS Glue. On this publish, we use the brand new AWS Glue Connector for Snowflake to seamlessly join with Snowflake with out the necessity to set up JDBC drivers. To validate the info ingested, we use Amazon Redshift Spectrum to create an exterior desk and question the info in Amazon S3. With Amazon Redshift Spectrum, you’ll be able to effectively question and retrieve knowledge from information in Amazon S3 with out having to load the info into Amazon Redshift tables.

Resolution Overview

Let’s check out the structure diagram on how AWS Glue connects to Snowflake for knowledge ingestion.

Stipulations

Earlier than you begin, be sure you have the next:

  1. An account in Snowflake, particularly a service account that has permissions to tables to be queried.
  2. AWS Id and Entry Administration (IAM) permissions in place to create AWS Glue and Amazon Redshift service roles and insurance policies. To configure, comply with the directions in Organising IAM Permissions for AWS Glue and Create an IAM position for Amazon Redshift.
  3. Amazon Redshift Serverless endpoint. In the event you don’t have it configured, comply with the directions in Amazon Redshift Serverless Analytics.

Configure the Amazon S3 VPC Endpoint

As a primary step, we configure an Amazon S3 VPC Endpoint to allow AWS Glue to make use of a personal IP deal with to entry Amazon S3 with no publicity to the general public web. Full the next steps.

  1. Open the Amazon VPC console.
  2. Within the left navigation pane, select Endpoints.
  3. Select Create Endpoint, and comply with the steps to create an Amazon S3 VPC endpoint of kind Gateway.

Subsequent, we create a secret utilizing AWS Secrets and techniques Supervisor

  1. On AWS Secrets and techniques Supervisor console, select Retailer a brand new secret.
  2. For Secret kind, choose Different kind of secret.
  3. Enter a key as sfUser and the worth as your Snowflake consumer identify.
  4. Enter a key as sfPassword and the worth as your Snowflake consumer password.
  5. Select Subsequent.
  6. Identify the key snowflake_credentials and comply with by the remainder of the steps to retailer the key.

Subscribe to AWS Market Snowflake Connector

To subscribe to the connector, comply with the steps and activate Snowflake Connector for AWS Glue. This native connector simplifies the method of connecting AWS Glue jobs to extract knowledge from Snowflake

  1. Navigate to the Snowflake Connector for AWS Glue in AWS Market.
  2. Select Proceed to Subscribe.
  3. Evaluate the phrases and circumstances, pricing, and different particulars.
  4. Select Proceed to Configuration.
  5. For Supply Methodology, select your supply technique.
  6. For Software program model, select your software program model
  7. Select Proceed to Launch.
  8. Underneath Utilization directions, select Activate the Glue connector in AWS Glue Studio. You’re redirected to AWS Glue Studio.
  9. For Identify, enter a reputation on your connection (for instance, snowflake_s3_glue_connection).
  10. Optionally, select a VPC, subnet, and safety group.
  11. For AWS Secret, select snowflake_credentials.
  12. Select Create connection.

A message seems that the connection was efficiently created, and the connection is now seen on the AWS Glue Studio console.

Configure AWS Glue for Snowflake JDBC connectivity

Subsequent, we configure a AWS Glue job by following the steps beneath to extract knowledge.

  1. On the AWS Glue console, select AWS Glue Studio on the left navigation pane.
  2. On the AWS Glue Studio console, select Jobs on the left navigation pane.
  3. Create a job with “Visible with supply and goal” and select the Snowflake connector for AWS Glue 3.0 because the supply and Amazon S3 because the goal.
  4. Enter a reputation for the job.
  5. Underneath job particulars, choose an IAM position.
  6. Create a brand new IAM position should you don’t have already with required AWS Glue and AWS Secrets and techniques Supervisor insurance policies.
  7. Underneath Visible, Select the Knowledge supply – Connection node and select the connection you created.
  8. In connection choices, create a key worth pair with question as proven beneath. Observe that CUSTOMER desk in SNOWFLAKE_SAMPLE_DATA database is taken into account for this migration. This desk will get preloaded (1.5M rows) once you set up Snowflake Schema.

    key worth
    question SELECT
    C_CUSTKEY,
    C_NAME,
    C_ADDRESS,
    C_NATIONKEY,
    C_PHONE,
    C_ACCTBAL,
    C_MKTSEGMENT,
    C_COMMENT
    FROM
    SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
    sfUrl MXA94638.us-east-1.snowflakecomputing.com
    sfDatabase SNOWFLAKE_SAMPLE_DATA
    sfWarehouse COMPUTE_WH
  9. Within the Output schema part, specify the supply schema as key-value pairs as proven beneath.
  10. Select the Remodel-ApplyMapping node to view the next remodel particulars.
  11. Select the Knowledge goal properties – S3 node and enter S3 bucket particulars as proven beneath.
  12. Select Save.

After you save the job, the next script is generated. It assumes the account data and credentials are saved in AWS Secrets and techniques Supervisor as described earlier.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node Snowflake Connector for AWS Glue 3.0
SnowflakeConnectorforAWSGlue30_node1 = glueContext.create_dynamic_frame.from_options(
    connection_type="market.spark",
    connection_options={
        "question": "SELECT C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER",
        "sfUrl": "MXA94638.us-east-1.snowflakecomputing.com",
        "sfDatabase": "SNOWFLAKE_SAMPLE_DATA",
        "sfWarehouse": "COMPUTE_WH",
        "connectionName": "snowflake_s3_glue_connection",
    },
    transformation_ctx="SnowflakeConnectorforAWSGlue30_node1",
)

# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
    body=SnowflakeConnectorforAWSGlue30_node1,
    mappings=[
        ("C_CUSTKEY", "decimal", "C_CUSTKEY", "decimal"),
        ("C_NAME", "string", "C_NAME", "string"),
        ("C_ADDRESS", "string", "C_ADDRESS", "string"),
        ("C_NATIONKEY", "decimal", "C_NATIONKEY", "decimal"),
        ("C_PHONE", "string", "C_PHONE", "string"),
        ("C_ACCTBAL", "decimal", "C_ACCTBAL", "decimal"),
        ("C_MKTSEGMENT", "string", "C_MKTSEGMENT", "string"),
        ("C_COMMENT", "string", "C_COMMENT", "string"),
    ],
    transformation_ctx="ApplyMapping_node2",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
    body=ApplyMapping_node2,
    connection_type="s3",
    format="glueparquet",
    connection_options={"path": "s3://sf-redshift-po/check/", "partitionKeys": []},
    format_options={"compression": "snappy"},
    transformation_ctx="S3bucket_node3",
)

job.commit()

  1. Select Run to run the job.

After the job completes efficiently, the run standing ought to change to Succeeded.

The next screenshot exhibits that the info was written to Amazon S3.

Question Amazon S3 knowledge utilizing Amazon Redshift Spectrum

Let’s question the info in Amazon Redshift Spectrum

  1. On the Amazon Redshift console, select the AWS Area.
  2. Within the left navigation pane, select Question Editor.
  3. Run the create exterior desk DDL given beneath.
    CREATE EXTERNAL TABLE "spectrum_schema"."buyer"
    (c_custkey   decimal(10,2),
    c_name       varchar(256),
    c_address    varchar(256),
    c_nationkey  decimal(10,2),
    c_phone      varchar(256),
    c_acctbal    decimal(10,2),
    c_mktsegment varchar(256),
    c_comment    varchar(256))
    saved as parquet
    location 's3://sf-redshift-po/check'; 

  1. Run the choose question on the CUSTOMER desk.
SELECT rely(c_custkey), c_nationkey FROM "dev"."spectrum_schema"."buyer" group by c_nationkey

Concerns

The AWS Glue crawler doesn’t work instantly with Snowflake. This can be a native functionality that you should use for different AWS knowledge sources which are joined or related within the AWS Glue ETL job. As a substitute, you’ll be able to outline connections within the script as proven earlier on this publish.

The Snowflake supply tables lined on this publish solely give attention to structured knowledge sorts and due to this fact semi-structured or unstructured knowledge sorts in Snowflake (binary, varbinary, and variant) are out of scope. Nevertheless, you could possibly use AWS Glue features corresponding to relationalize to flatten nested schema knowledge into semi-normalized constructions, or you could possibly use Amazon Redshift Spectrum to assist these knowledge sorts.

Conclusion

On this publish, we learnt methods to outline Snowflake connection parameters in AWS Glue, connect with Snowflake from AWS Glue utilizing the AWS native connector for Snowflake, migrate to Amazon S3 and use Redshift Spectrum to question knowledge in Amazon S3 to fulfill your small business wants.

We welcome any ideas or questions within the feedback part beneath


In regards to the Authors

Sindhu Achuthan is a Knowledge Architect with World Monetary Companies at Amazon Internet Companies. She works with clients to offer architectural steering for analytics options on Amazon Glue, Amazon Redshift, AWS Lambda, and different providers. Exterior work, she is a DIYer, likes to go on lengthy trails, and do yoga.

Shayon Sanyal is a Sr. Options Architect specializing in databases at AWS. His day job permits him to assist AWS clients design scalable, safe, performant and sturdy database architectures on the cloud. Exterior work, you could find him climbing, touring or coaching for the subsequent half-marathon.

[ad_2]

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments