Friday, August 2, 2024
HomeBig DataTips on how to Velocity Up Knowledge Circulate Between Databricks and SAS

Tips on how to Velocity Up Knowledge Circulate Between Databricks and SAS

[ad_1]

This can be a collaborative put up between Databricks and T1A. We thank Oleg Mikhov, Options Architect at T1A, for his contributions.

 
That is the primary put up in a sequence of blogs on the perfect practices of bringing collectively Databricks Lakehouse Platform and SAS. A earlier Databricks weblog put up launched Databricks and PySpark to SAS builders. On this put up, we focus on methods for exchanging knowledge between SAS and Databricks Lakehouse Platform and methods to hurry up the information movement. In future posts, we are going to discover constructing environment friendly knowledge and analytics pipelines involving each applied sciences.

Knowledge-driven organizations are quickly adopting the Lakehouse platform to maintain up with the continuously rising enterprise calls for. Lakehouse platform has turn into a brand new norm for organizations wanting to construct knowledge platforms and structure. The modernization entails shifting knowledge, purposes, or different enterprise components to the cloud. Nevertheless, the transition to the cloud is a gradual course of and it’s business-critical to proceed leveraging legacy investments for so long as attainable. With that in thoughts, many corporations are inclined to have a number of knowledge and analytics platforms, the place the platforms coexist and complement one another.

One of many combos we see is using SAS with the Databricks Lakehouse. There are various advantages of enabling the 2 platforms to effectively work collectively, resembling:

  • Better and scalable knowledge storage capabilities of cloud platforms
  • Better computing capability utilizing applied sciences, resembling Apache Spark™, natively constructed with parallel processing capabilities
  • Obtain better compliance with knowledge governance and administration utilizing Delta Lake
  • Decrease the price of knowledge analytics infrastructure with simplified architectures

Some widespread knowledge science and knowledge evaluation use instances and causes noticed are:

  1. SAS practitioners leverage SAS for its core statistical packages to develop superior analytics output that meets regulatory necessities whereas they use Databricks Lakehouse for knowledge administration, ELT forms of processing, and knowledge governance
  2. Machine studying fashions developed in SAS are scored on large quantities of knowledge utilizing parallel processing structure of Apache Spark engine within the Lakehouse platform
  3. SAS knowledge analysts achieve sooner entry to giant quantities of knowledge within the Lakehouse Platform for ad-hoc evaluation and reporting utilizing Databricks SQL endpoints and excessive bandwidth connectors
  4. Ease cloud modernization and migration journey by establishing a hybrid workstream involving each cloud structure and on-prem SAS platform

Nevertheless, a key problem of this coexistence is how the information is performantly shared between the 2 platforms. On this weblog, we share greatest practices carried out by T1A for his or her prospects and benchmark outcomes evaluating completely different strategies of shifting knowledge between Databricks and SAS.

Eventualities

The preferred use case is a SAS developer making an attempt to entry knowledge within the lakehouse. The analytics pipelines involving each applied sciences require knowledge movement in each instructions: knowledge moved from Databricks to SAS and knowledge moved from SAS to Databricks.

  1. Entry Delta Lake from SAS: A SAS consumer desires to entry huge knowledge in Delta Lake utilizing the SAS programming language.
  2. Entry SAS datasets from Databricks: A Databricks consumer desires to entry SAS datasets, usually the sas7bdat datasets as a DataFrame to course of in Databricks pipelines or retailer in Delta Lake for enterprise-wide entry.

In our benchmark checks, we used the next surroundings setup:

  1. Microsoft Azure because the cloud platform
  2. SAS 9.4M7 on Azure (single node Normal D8s v3 VM)
  3. Databricks runtime 9.0, Apache Spark 3.1.2 (2 nodes Normal DS4v2 cluster)

Determine 1 exhibits the conceptual structure diagram with the parts mentioned. Databricks Lakehouse sits on Azure Knowledge Lake storage with Delta Lake medallion structure. SAS 9.4 put in on Azure VM connects to Databricks Lakehouse to learn/write knowledge utilizing connection choices mentioned within the following sections.

SAS and Databricks conceptual architecture diagram on Azure
Determine 1 SAS and Databricks conceptual structure diagram on Azure

The diagram above exhibits a conceptual structure of Databricks deployed on Azure. The structure might be comparable on different cloud platforms. On this weblog, we solely focus on the combination with the SAS 9.4 platform. In a later weblog put up, we are going to prolong this dialogue to entry lakehouse knowledge from SAS Viya.

Entry Delta Lake from SAS

Think about that now we have a Delta Lake desk that must be processed in a SAS program. We wish the perfect efficiency when accessing this desk, whereas additionally avoiding any attainable points with knowledge integrity or knowledge sorts compatibility. There are other ways to realize knowledge integrity and compatibility. Beneath we focus on a couple of strategies and examine them on ease of use and efficiency.

In our testing, we used the eCommerce conduct dataset (5.67GB, 9 columns, ~ 42 mill data) from Kaggle.
Knowledge Supply Credit score: eCommerce conduct knowledge from multi class retailer and REES46 Advertising and marketing Platform.

Examined strategies

1. Utilizing SAS/ACCESS Interface connectors
Historically, SAS customers leverage SAS/ACCESS software program to hook up with exterior knowledge sources. You’ll be able to both use a SAS LIBNAME assertion pointing to the Databricks cluster or use the SQL pass-through facility. At current for SAS 9.4, there are three connection choices obtainable.

  1. SAS/ACCESS Interface to ODBC
  2. SAS/ACCESS Interface to JDBC
  3. SAS/ACCESS Interface to Spark

SAS/ACCESS Interface to Spark has been just lately loaded with capabilities with unique help to Databricks clusters. See this video for a brief demonstration. The video mentions SAS Viya however the identical is relevant to SAS 9.4.

Code samples on learn how to use these connectors may be discovered on this git repository: T1A Git – SAS Libraries Examples.

2. Utilizing saspy package deal
The open-source library, saspy, from SAS Institute permits Databricks Pocket book customers to run SAS statements from a Python cell within the pocket book to execute code within the SAS server, in addition to to import and export knowledge from SAS datasets to Pandas DataFrame.

Because the focus of this part is accessing lakehouse knowledge by a SAS programmer utilizing SAS programming, this technique was wrapped in a SAS macro program just like the purpose-built integration technique mentioned subsequent.

To realize higher efficiency with this package deal, we examined the configuration with an outlined char_length choice (particulars obtainable right here). With this feature, we are able to outline lengths for character fields within the dataset. In our checks utilizing this feature introduced a further 15% enhance in efficiency. For the transport layer between environments, we used the saspy configuration with an SSH connection to the SAS server.

3. Utilizing a purpose-built integration
Though the 2 strategies talked about above have their upsides, the efficiency may be improved additional by addressing some shortcomings, mentioned within the subsequent part (Take a look at Outcomes), of the earlier strategies. With that in thoughts, we developed a SAS macro-based integration utility with a main concentrate on efficiency and usefulness for SAS customers. The SAS macro may be simply built-in into current SAS code with none data about Databricks platform, Apache Spark or Python.

The macro orchestrates a multistep course of utilizing Databricks API:

  1. Instruct the Databricks cluster to question and extract knowledge per the supplied SQL question and cache the ends in DBFS, counting on its Spark SQL distributed processing capabilities.
  2. Compress and securely switch the dataset to the SAS server (CSV in GZIP) over SSH
  3. Unpack and import knowledge into SAS to make it obtainable to the consumer within the SAS library. At this step, leverage column metadata from Databricks knowledge catalog (column sorts, lengths, and codecs) for constant, appropriate and environment friendly knowledge presentation in SAS

Be aware that for variable-length knowledge sorts, the combination helps completely different configuration choices, relying on what most closely fits the consumer necessities resembling,

  • want for utilizing a configurable default worth
  • profiling to 10,000 rows (+ add headroom) to determine the most important worth
  • profiling the whole column within the dataset to determine the most important worth

A simplified model of the code is out there right here T1A Git – SAS DBR Customized Integration.

The tip-user utilization of this SAS macro appears to be like as proven under, and takes three inputs:

  1. SQL question, primarily based on which knowledge might be extracted from Databricks
  2. SAS libref the place the information ought to land
  3. Identify to be given to the SAS dataset

Take a look at outcomes

Databricks to SAS data access methods performance
Determine 2 Databricks to SAS knowledge entry strategies efficiency

As proven within the plot above, for the take a look at dataset, the outcomes present that SAS/ACCESS Interface to JDBC and SAS/ACCESS Interface to Apache Spark confirmed comparable efficiency and carried out decrease in comparison with different strategies. The principle motive for that’s the JDBC strategies don’t profile character columns in datasets with the intention to set correct column size within the SAS dataset. As a substitute, they outline the default size for all character column sorts (String and Varchar) as 765 symbols. That causes not solely efficiency points throughout preliminary knowledge retrieval however for all additional processing. Plus it consumes important extra storage. In our checks, for the supply dataset of 5.6 GB, we ended with a 216 GB file within the WORK library. Nevertheless, with the SAS/ACCESS Interface to ODBC, the default size was 255 symbols, which resulted in a major efficiency enhance.

Utilizing SAS/ACCESS Interface strategies is essentially the most handy choice for current SAS customers. There are some vital concerns once you use these strategies

  1. Each options help implicit question pass-through however with some limitations:
  • SAS/ACCESS Interface to JDBC/ODBC help solely pass-through for PROC SQL statements
  • Along with PROC SQL pass-through SAS/ACCESS Interface to Apache Spark helps pass-through for a lot of the SQL capabilities. This technique additionally permits pushing widespread SAS procedures to Databricks clusters.
  • The problem with setting the size for the character columns described earlier than. As a workaround, we propose utilizing the DBSASTYPE choice to explicitly set column size for SAS tables. This may assist with additional processing of the dataset however received’t have an effect on the preliminary retrieval of the information from Databricks
  • SAS/ACCESS Interface to Apache Spark/JDBC/ODBC doesn’t permit combining tables from completely different Databricks databases (schemas) assigned as completely different libnames in the identical question (becoming a member of them) with the pass-through facility. As a substitute, it can trigger exporting complete tables in SAS and processing in SAS. As a workaround, we propose making a devoted schema in Databricks that can comprise views primarily based on tables from completely different databases (schemas).
  • Utilizing the saspy technique confirmed barely higher efficiency in comparison with SAS/ACCESS Interface to JDBC/Spark strategies, nevertheless, the principle downside is that saspy library solely works with pandas DataFrames and it places a major load on the Apache Spark driver program and requires the whole DataFrame to be pulled into reminiscence.

    The purpose-built integration technique confirmed the perfect efficiency in comparison with different examined strategies. Determine 3 exhibits a movement chart with high-level steering in selecting from the strategies mentioned.

    Databricks to SAS data access - method selection
    Determine 3 Databricks to SAS knowledge entry – technique choice

    Entry SAS datasets from Databricks

    This part addresses the necessity by Databricks builders to ingest a SAS dataset into Delta Lake and make it obtainable in Databricks for enterprise intelligence, visible analytics, and different superior analytics use instances whereas a few of the beforehand described strategies are relevant right here, some extra strategies are mentioned.

    Within the take a look at, we begin with a SAS dataset (in sas7bdat format) on the SAS server, and ultimately, now we have this dataset obtainable as Spark DataFrame (if the lazy invocation is relevant we power to load knowledge in a DataFrame and measure the general time) in Databricks.

    We used the identical surroundings and the identical dataset for this state of affairs that was used within the earlier state of affairs. The checks don’t contemplate the use case the place a SAS consumer writes a dataset into Delta Lake utilizing SAS programming. This includes considering cloud supplier instruments and capabilities which might be mentioned in a later weblog put up.

    Examined strategies

    1. Utilizing the saspy package deal from SAS
    The sd2df technique within the saspy library converts a SAS dataset to a pandas DataFrame, utilizing SSH for knowledge switch. It provides a number of choices for staging storage (Reminiscence, CSV, DISK) in the course of the switch. In our take a look at, the CSV choice, which makes use of PROC EXPORT csv file and pandas read_csv() strategies, which is the beneficial choice for big knowledge units, confirmed the perfect efficiency.

    2. Utilizing pandas technique
    Since early releases pandas allowed customers to learn sas7bdat recordsdata utilizing pandas.read_sas API. The SAS file needs to be accessible to the python program. Generally used strategies are FTP, HTTP, or shifting to cloud object storage resembling S3. We moderately used an easier method to maneuver a SAS file from the distant SAS server to the Databricks cluster utilizing SCP.

    3. Utilizing spark-sas7bdat
    Spark-sas7bdat is an open-source package deal developed particularly for Apache Spark. Just like the pandas.read_sas() technique, the SAS file should be obtainable on the filesystem. We downloaded the sas7bdat file from a distant SAS Server utilizing SCP.

    4. Utilizing a purpose-built integration
    One other technique that was explored is utilizing standard strategies with a concentrate on balancing comfort and efficiency. This technique abstracts away core integrations and is made obtainable to the consumer as a Python library which is executed from the Databricks Pocket book.

    1. Use saspy package deal to execute a SAS macro code (on a SAS server) which does the next
    • Export sas7bdat to CSV file utilizing SAS code
    • Compress the CSV file to GZIP
  • Transfer the compressed file to the Databricks cluster driver node utilizing SCP
  • Decompresses the CSV file
  • Reads CSV file to Apache Spark DataFrame
  • Take a look at outcomes

    SAS to Databricks data access methods performance
    Determine 4 SAS to Databricks knowledge entry strategies efficiency

    The spark-sas7bdat confirmed the perfect efficiency amongst all of the strategies. This package deal takes full benefit of parallel processing in Apache Spark. It distributes blocks of sas7bdat recordsdata on employee nodes. The key downside of this technique is that sas7bdat is a proprietary binary format, and the library was constructed primarily based on reverse engineering of this binary format, so it doesn’t help all forms of sas7bdat recordsdata, in addition to it isn’t formally (commercially) vendor-supported.

    The saspy and pandas strategies are comparable in the best way that they’re each constructed for a single node surroundings and each learn knowledge to pandas DataFrame requiring a further step earlier than having the information obtainable as a Spark DataFrame.

    The purpose-built integration macro confirmed higher efficiency in comparison with saspy and pandas as a result of it reads knowledge from CSV by means of Apache Spark APIs. Nevertheless, it doesn’t beat the efficiency of the spark-sas7bdat package deal. The aim-built technique may be handy in some instances because it permits including intermediate knowledge transformations on the SAS server.

    Conclusion

    Increasingly more enterprises are gravitating in the direction of constructing a Databricks Lakehouse and there are a number of methods of accessing knowledge from the Lakehouse through different applied sciences. This weblog discusses how SAS builders, knowledge scientists and different enterprise customers can leverage the information within the Lakehouse and write the outcomes to the cloud. In our experiment, we examined a number of completely different strategies of studying and writing knowledge between Databricks and SAS. The strategies fluctuate not solely by efficiency however by comfort and extra capabilities that they supply.

    For this take a look at, we used the SAS 9.4M7 platform. SAS Viya helps a lot of the mentioned approaches but additionally offers extra choices. In the event you’d prefer to study extra concerning the strategies or different specialised integration approaches not lined right here, be at liberty to achieve out to us at Databricks or [email protected].

    Within the upcoming posts on this weblog sequence, we are going to focus on greatest practices in implementing built-in knowledge pipelines, end-to-end workflows, utilizing SAS and Databricks and learn how to leverage SAS In-Database applied sciences for scoring SAS fashions in Databricks clusters.

    SAS® and all different SAS Institute Inc. services or products names are registered emblems or emblems of SAS Institute Inc. within the USA and different international locations. ® signifies USA registration.

    Get began

    Attempt the course, Databricks for SAS Customers, on Databricks Academy to get a primary hands-on expertise with PySpark programming for SAS programming language constructs and contact us to study extra about how we are able to help your SAS group to onboard their ETL workloads to Databricks and allow greatest practices.



    [ad_2]

    RELATED ARTICLES

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here

    Most Popular

    Recent Comments