Saturday, November 16, 2024
HomeBig DataDevelop and take a look at AWS Glue model 3.0 jobs regionally...

Develop and take a look at AWS Glue model 3.0 jobs regionally utilizing a Docker container

[ad_1]
[*]

AWS Glue is a completely managed serverless service that lets you course of knowledge coming via totally different knowledge sources at scale. You should utilize AWS Glue jobs for varied use instances corresponding to knowledge ingestion, preprocessing, enrichment, and knowledge integration from totally different knowledge sources. AWS Glue model 3.0, the newest model of AWS Glue Spark jobs, offers a performance-optimized Apache Spark 3.1 runtime expertise for batch and stream processing.

You may creator AWS Glue jobs in numerous methods. If you happen to favor coding, AWS Glue lets you write Python/Scala supply code with the AWS Glue ETL library. If you happen to favor interactive scripting, AWS Glue interactive classes and AWS Glue Studio notebooks lets you write scripts in notebooks by inspecting and visualizing the info. If you happen to favor a graphical interface relatively than coding, AWS Glue Studio helps you creator knowledge integration jobs visually with out writing code.

For a production-ready knowledge platform, a improvement course of and CI/CD pipeline for AWS Glue jobs is vital. We perceive the large demand for creating and testing AWS Glue jobs the place you favor to have flexibility, an area laptop computer, a Docker container on Amazon Elastic Compute Cloud (Amazon EC2), and so forth. You may obtain that through the use of AWS Glue Docker photos hosted on Docker Hub or the Amazon Elastic Container Registry (Amazon ECR) Public Gallery. The Docker photos allow you to arrange your improvement atmosphere with extra utilities. You should utilize your most popular IDE, pocket book, or REPL utilizing the AWS Glue ETL library.

This submit is a continuation of weblog submit “Creating AWS Glue ETL jobs regionally utilizing a container“. Whereas the sooner submit launched the sample of improvement for AWS Glue ETL Jobs on a Docker container utilizing a Docker picture, this submit focuses on develop and take a look at AWS Glue model 3.0 jobs utilizing the identical strategy.

Resolution overview

The next Docker photos can be found for AWS Glue on Docker Hub:

  • AWS Glue model 3.0amazon/aws-glue-libs:glue_libs_3.0.0_image_01
  • AWS Glue model 2.0amazon/aws-glue-libs:glue_libs_2.0.0_image_01

You can even acquire the pictures from the Amazon ECR Public Gallery:

  • AWS Glue model 3.0public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01
  • AWS Glue model 2.0public.ecr.aws/glue/aws-glue-libs:glue_libs_2.0.0_image_01

Notice: AWS Glue Docker photos are x86_64 suitable and arm64 hosts are at the moment not supported.

On this submit, we use amazon/aws-glue-libs:glue_libs_3.0.0_image_01 and run the container on an area machine (Mac, Home windows, or Linux). This container picture has been examined for AWS Glue model 3.0 Spark jobs. The picture incorporates the next:

  • Amazon Linux
  • AWS Glue ETL Library (aws-glue-libs)
  • Apache Spark 3.1.1
  • Spark historical past server
  • JupyterLab
  • Livy
  • Different library dependencies (the identical as those of the AWS Glue job system)

To arrange your container, you pull the picture from Docker Hub after which run the container. We display run your container with the next strategies, relying in your necessities:

  • spark-submit
  • REPL shell (pyspark)
  • pytest
  • JupyterLab
  • Visible Studio Code

Stipulations

Earlier than you begin, make it possible for Docker is put in and the Docker daemon is working. For set up directions, see the Docker documentation for Mac, Home windows, or Linux. Additionally just be sure you have no less than 7 GB of disk house for the picture on the host working Docker.

For extra details about restrictions when creating AWS Glue code regionally, see Native Improvement Restrictions.

Configure AWS credentials

To allow AWS API calls from the container, arrange your AWS credentials with the next steps:

  1. Create an AWS named profile.
  2. Open cmd on Home windows or a terminal on Mac/Linux, and run the next command:
    PROFILE_NAME="profile_name"

Within the following sections, we use this AWS named profile.

Pull the picture from Docker Hub

If you happen to’re working Docker on Home windows, select the Docker icon (right-click) and select Change to Linux containers… earlier than pulling the picture.

Run the next command to drag the picture from Docker Hub:

docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01

Run the container

Now you may run a container utilizing this picture. You may select any of following strategies primarily based in your necessities.

spark-submit

You may run an AWS Glue job script by working the spark-submit command on the container.

Write your ETL script (pattern.py within the instance beneath) and put it aside below the /local_path_to_workspace/src/ listing utilizing the next instructions:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=pattern.py
$ mkdir -p ${WORKSPACE_LOCATION}/src
$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

These variables are used within the docker run command beneath. The pattern code (pattern.py) used within the spark-submit command beneath is included within the appendix on the finish of this submit.

Run the next command to run the spark-submit command on the container to submit a brand new Spark software:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_3.0.0_image_01 spark-submit /dwelling/glue_user/workspace/src/$SCRIPT_FILE_NAME
...22/01/26 09:08:55 INFO DAGScheduler: Job 0 completed: fromRDD at DynamicFrame.scala:305, took 3.639886 s
root
|-- family_name: string
|-- identify: string
|-- hyperlinks: array
| |-- ingredient: struct
| | |-- notice: string
| | |-- url: string
|-- gender: string
|-- picture: string
|-- identifiers: array
| |-- ingredient: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- ingredient: struct
| | |-- lang: string
| | |-- notice: string
| | |-- identify: string
|-- sort_name: string
|-- photos: array
| |-- ingredient: struct
| | |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
| |-- ingredient: struct
| | |-- kind: string
| | |-- worth: string
|-- death_date: string

...

REPL shell (pyspark)

You may run a REPL (read-eval-print loop) shell for interactive improvement. Run the next command to run the pyspark command on the container to begin the REPL shell:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark
...
 ____ __
 / __/__ ___ _____/ /__
 _ / _ / _ `/ __/ '_/
 /__ / .__/_,_/_/ /_/_  model 3.1.1-amzn-0
 /_/

Utilizing Python model 3.7.10 (default, Jun 3 2021 00:02:01)
Spark context Internet UI out there at http://56e99d000c99:4040
Spark context out there as 'sc' (grasp = native[*], app id = local-1643011860812).
SparkSession out there as 'spark'.
>>> 

pytest

For unit testing, you should use pytest for AWS Glue Spark job scripts.

Run the next instructions for preparation:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=pattern.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/assessments
$ vim ${WORKSPACE_LOCATION}/assessments/${UNIT_TEST_FILE_NAME}

Run the next command to run pytest on the take a look at suite:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pytest amazon/aws-glue-libs:glue_libs_3.0.0_image_01 -c "python3 -m pytest"
beginning org.apache.spark.deploy.historical past.HistoryServer, logging to /dwelling/glue_user/spark/logs/spark-glue_user-org.apache.spark.deploy.historical past.HistoryServer-1-5168f209bd78.out
============================================================= take a look at session begins =============================================================
platform linux -- Python 3.7.10, pytest-6.2.3, py-1.11.0, pluggy-0.13.1
rootdir: /dwelling/glue_user/workspace
plugins: anyio-3.4.0
collected 1 merchandise  

assessments/test_sample.py . [100%]

============================================================== warnings abstract ===============================================================
assessments/test_sample.py::test_counts
 /dwelling/glue_user/spark/python/pyspark/sql/context.py:79: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() as a substitute.
 DeprecationWarning)

-- Docs: https://docs.pytest.org/en/secure/warnings.html
======================================================== 1 handed, 1 warning in 21.07s ========================================================

JupyterLab

You can begin Jupyter for interactive improvement and advert hoc queries on notebooks. Full the next steps:

  1. Run the next command to begin JupyterLab:
    $ JUPYTER_WORKSPACE_LOCATION=/local_path_to_workspace/jupyter_workspace/
    $ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $JUPYTER_WORKSPACE_LOCATION:/dwelling/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 -p 8998:8998 -p 8888:8888 --name glue_jupyter_lab amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /dwelling/glue_user/jupyter/jupyter_start.sh
    ...
    [I 2022-01-24 08:19:21.368 ServerApp] Serving notebooks from native listing: /dwelling/glue_user/workspace/jupyter_workspace
    [I 2022-01-24 08:19:21.368 ServerApp] Jupyter Server 1.13.1 is working at:
    [I 2022-01-24 08:19:21.368 ServerApp] http://faa541f8f99f:8888/lab
    [I 2022-01-24 08:19:21.368 ServerApp] or http://127.0.0.1:8888/lab
    [I 2022-01-24 08:19:21.368 ServerApp] Use Management-C to cease this server and shut down all kernels (twice to skip affirmation).

  2. Open http://127.0.0.1:8888/lab in your net browser in your native machine to entry the JupyterLab UI.
  3. Select Glue Spark Native (PySpark) below Pocket book.

Now you can begin creating code within the interactive Jupyter pocket book UI.

Visible Studio Code

To arrange the container with Visible Studio Code, full the next steps:

  1. Set up Visible Studio Code.
  2. Set up Python.
  3. Set up Visible Studio Code Distant – Containers.
  4. Open the workspace folder in Visible Studio Code.
  5. Select Settings.
  6. Select Workspace.
  7. Select Open Settings (JSON).
  8. Enter the next JSON and put it aside:
    {
        "python.defaultInterpreterPath": "/usr/bin/python3",
        "python.evaluation.extraPaths": [
            "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/",
        ]
    }

Now you’re able to arrange the container.

  1. Run the Docker container:
    $ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark

  2. Begin Visible Studio Code.
  3. Select Distant Explorer within the navigation pane, and select the container amazon/aws-glue-libs:glue_libs_3.0.0_image_01.
  4. Proper-click and select Connect to Container.
  5. If the next dialog seems, select Obtained it.
  6. Open /dwelling/glue_user/workspace/.
  7. Create an AWS Glue PySpark script and select Run.

It’s best to see the profitable run on the AWS Glue PySpark script.

Conclusion

On this submit, we realized get began on AWS Glue Docker photos. AWS Glue Docker photos allow you to develop and take a look at your AWS Glue job scripts wherever you favor. It’s out there on Docker Hub and Amazon ECR Public Gallery. Test it out, we stay up for getting your suggestions.

Appendix: AWS Glue job pattern codes for testing

This appendix introduces three totally different scripts as AWS Glue job pattern codes for testing functions. You should utilize any of them within the tutorial.

The next pattern.py code makes use of the AWS Glue ETL library with an Amazon Easy Storage Service (Amazon S3) API name. The code requires Amazon S3 permissions in AWS Id and Entry Administration (IAM). It’s essential grant the IAM-managed coverage arn:aws:iam::aws:coverage/AmazonS3ReadOnlyAccess or IAM customized coverage that lets you make ListBucket and GetObject API requires the S3 path.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
        else:
            jobname = "take a look at"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/individuals.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format="json"
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()z
	

The next test_sample.py code is a pattern for a unit take a look at of pattern.py:

import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import pattern


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

    job.commit()


def test_counts(glue_context):
    dyf = pattern.read_json(glue_context, "s3://awsglue-datasets/examples/us-legislators/all/individuals.json")
    assert dyf.toDF().depend() == 1961


In regards to the Authors

Subramanya Vajiraya is a Cloud Engineer (ETL) at AWS Sydney specialised in AWS Glue. He’s obsessed with serving to prospects remedy points associated to their ETL workload and implement scalable knowledge processing and analytics pipelines on AWS. Outdoors of labor, he enjoys happening bike rides and taking lengthy walks along with his canine Ollie, a 1-year-old Corgi.

Vishal Pathak is a Knowledge Lab Options Architect at AWS. Vishal works with prospects on their use instances, architects options to unravel their enterprise issues, and helps them construct scalable prototypes. Previous to his journey in AWS, Vishal helped prospects implement enterprise intelligence, knowledge warehouse, and knowledge lake tasks within the US and Australia.

Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue workforce. He enjoys studying totally different use instances from prospects and sharing information about massive knowledge applied sciences with the broader neighborhood.

[*][ad_2]

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments