Sap Datasphere Data Flow Series – Script Operator Part 1
2023-11-25 17:15:33 Author: blogs.sap.com(查看原文) 阅读量:8 收藏

Hey Champs,

Hope you enjoyed the last blog on  Sap Datasphere Data Flow Series – Operators (Joins, Projection, Aggregation, Union), well its just the trailer. Today lets jump into script operator of Datasphere in details.

The Script Operator seamlessly integrates the functionalities of popular Python libraries Pandas and NumPy with SAP Data Warehouse Cloud, enabling the creation of Data Flows and tailored information views. This versatile operator caters to a diverse range of tasks, including data cleansing, data transformation, and more.

Syntax for script operator:

———————————————————————–

def transformation(data):

       # Fill in your scripts here with data as a source

       # and save the values into data as output

return data

————————————————————————

Currently included libraries are restricted to Pandas, NumPy and several built-in module operations. But still not all functions are supported we can check the official documentation for the supported things.

Sap Official Documentation

Limitations of Sap Datasphere Script Operator:

  • We have a dedicated python editor where we can write our python code. But the only limitation is that we don’t have a syntax validator. And that’s where it hurts a lot, hope will get in future soon.
  • Due to the maximum batch size limit of 100,000 entries, the given Pandas function cannot be used to effectively remove duplicates. Using SQL-Script is the recommended approach for this task to ensure accurate duplicate detection, even across batches.

Limitation%20duet%20to%20batch%20size

Limitation due to batch size

Importance of DataSphere script operator in real world:

The Script Operator in SAP DataSphere offers several valuable use cases and is crucial for real-world data manipulation and analysis tasks. Here are some key usage and advantage of the Script Operator:

  • Data Cleaning and Preprocessing: The Script Operator works very well in cleaning and preparing raw data for analysis. It helps the users to handle tasks like removing invalid characters, imputing missing values, standardizing data formats, and detecting outliers. This ensures data quality and reliability for analytical processes.
  • Complex Data Transformations: The Script Operator extends the capabilities of SAP DataSphere by enabling complex data transformations which are not feasible with built-in operators. It provides user the flexibility to perform advanced data manipulation techniques like feature extraction from text data, data enrichment through external sources, and custom aggregation functions.
  • Integration with Python Libraries: The Script Operator seamlessly integrates with popular Python libraries like Pandas, NumPy, and built-in modules. This allows users to leverage the extensive functionalities of these libraries for data manipulation, analysis, and exploration.
  • Handling Unstructured Data: The Script Operator effectively handles unstructured data formats like text, JSON, and XML. Users can write Python code to parse, extract, and transform unstructured data, making it usable for downstream analysis.
  • Custom Data Validation and Checks: The Script Operator facilitates custom data validation and checks. Users can write Python code to implement custom business rules, data quality checks, and anomaly detection mechanisms, ensuring data integrity and consistency.

Note : We will try to learn from use case about how we can use the python operators and get familiar with the syntax. Not all use cases will be useful. Try to get the use of the operators and how to write the code

Use Case 1:

We will always get a scenario where we want to count the number of employee for each department or number of items in a order. So in this case we will try to take a similar example where I will use script operator to find the count of different payment mode. Using this we will see how efficient is script operator and what flexibility it gives us while writing python code.

Let’s have a look at our data:

This is our order table data and using this I want to count the payment modes.

Order%20Table%20Sample%20Data

Order Table Sample Data

So we will expect some output like this.

Expected%20Output%20Script%20Operator

Expected Output Script Operator

Let’s have a look at the code and understand it step by step:

Python%20Script%20to%20group%20Data

Python Script to group Data

This code defines a function called transform that takes a Pandas DataFrame as input and returns a new DataFrame with two columns: ‘Payment Mode’ and ‘Count of Payment mode’. The function performs the following steps:

  1. Convert ‘Payment Mode’ column to categorical data type: The Payment Mode column is converted to a categorical data type using the pd.Categorical() function. This is useful for tasks like data analysis and machine learning, as it allows for more efficient data manipulation and representation.
  2. Group the data by the ‘Payment Mode’ column: The data is grouped by the ‘Payment Mode’ column using the DataFrame.groupby() method. This creates a grouping object that allows us to iterate over the data grouped by the payment mode.
  3. Create a new DataFrame to store the desired output: An empty DataFrame is created with two columns: ‘Payment Mode’ and ‘Count of Payment mode’. This DataFrame will store the final output of the function.

Now lets get our hand dirty by entering to datasphere and using the dataflow.

  1. Add a table as source into the play area.
  2. Add a projection and keep only payment mode and remove other columns.
  3. Add a script operator and write the above script and connect everything.
  4. Add a target table and connect the script operator to the target table. Now save and deploy it

Dataflow%20Script%20Operator

Dataflow Script Operator

Conclusion :

This blog introduced the script operator available in datasphere data flow that we’ll use along the entire blog series, I have explained each and every part of the script operator in detail. Script allow is a very handy thing to do a lot of manipulation and analysis of data. The things which we can do using sql script, the same thing also we can do using script operator but since pyhon is used as a language so its gives us the flexibility to use lots of inbuilt function to do a more operation on data. I will continue the second part of this script operator to show what things are possible easily using script operator which we can’t do using sql script.

Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post. Feel free to also check out the other blog posts in the series and follow me to learn as well as master sap analytics. Let me know if you find something can be improved or added.

Best wishes,

Kunal Mohanty


文章来源: https://blogs.sap.com/2023/11/25/sap-datasphere-data-flow-series-script-operator-part-1/
如有侵权请联系:admin#unsafe.sh