Unlocking Data Value #3: Machine Learning with SAP

Unlocking Data Value #3: Machine Learning with SAP
2023-11-23 18:53:52 Author: blogs.sap.com(查看原文) 阅读量:6 收藏

This blog post is part of a blog post series where we explore how to leverage SAP BTP Data and Analytics capabilities to unlock the full potential of your data. In this series, we will take you on an end-to-end journey through a real-world scenario, demonstrating how SAP Datasphere, SAP Analytics Cloud, and SAP Data Intelligence Cloud can empower your data-driven solutions.

The blog post series consists of four blog posts:

Unlocking Data Value #1: SAP BTP Data and Analytics overview: Learn about our objectives, SAP products, and real-world use case.
Unlocking Data Value #2: Data Integration and Modeling with SAP Datasphere: Explore integration options and data modeling using Data Builder and Business Builder.
Unlocking Data Value #3: Machine Learning with SAP: Discover SAP HANA ML and Data Intelligence for building and operationalizing machine learning models.
Unlocking Data Value #4: Data visualization with SAP Analytics Cloud: Dive into Business Intelligence and Data Connectivity.

The full content of this series was delivered in live sessions to SAP Partners. It was built in cooperation with Thiago de Castro Mendes, Amine MABROUK, Cesare Calabria, Alice Magnani, Mostafa SHARAF, Dayanand Karalkar, Ashish Aitiwadekar and Andrii Rzhaksynskyi. You can find complete sessions’ recordings here.

Before we proceed further, let’s recap the storyline on which this blog post series is based: we have imagined that a SAP partner developed a solution for bookshop management for a bookshop that is active since many years already. This solution has been developed on CAP with SAP HANA Cloud as the persistency layer and integrated with S/4HANA Cloud. We want to improve the existing solution and take advantage of customers’ data asset turning it into business value with actionable insights and intelligence and transform the bookshop into an intelligent bookshop including best-selling books, ABC classification, sales trendings and forecast, a book recommendation engine based on sales data and book inventory optimization based on a book genre clustering.

We have seen how we can prepare and model all the data we need in the second blog post of this series, so let’s focus now on how we can infuse intelligence in this bookshop solution. But before we do that, let me introduce some basic notions about Artificial Intelligence.

Introduction to Artificial Intelligence

Terms such as data science, artificial intelligence, machine learning, even deep learning, we are hearing them constantly. These fields are gaining more and more popularity. It’s a very hot research field.

The landscape of available algorithms and applications is constantly evolving and all these techniques in the end deal with the basic concepts of having a machine learning something, some patterns, some trend from real data. So if applied properly, these techniques can be really precious in the process of extracting values from data. Recently we have seen a big hype around GenAI and LLM models. Apart from that, there are many other AI techniques already applied at industry level since years. Below you can find some examples of machine learning applications that are often seen in industries:

for instance, a very very common scenario is the churn analysis. If you think about telecommunication industry, you might want to classify your customers based on how likely your customers are to quit the company and go for another provider;
another very common use case is trying to predict the market trends and for instance in advance the prices of a certain goods. This is an example of regression;
you might want to group your customers based on their behaviors, their age, their preferences, to enable marketers to target each group of customers with the most appropriate marketing strategy, and this is an example of clustering;
you can provide personalized products, recommendation by analyzing the history of sales within your enterprise.

There are many other scenarios where machine learning can be useful, and it is often applied in industry use case. But the real question is: how it is done?

Figure%201%3A%20Cross-industry%20standard%20process%20for%20data%20mining%20%28CRISP-DM%29.

Figure 1: Cross-industry standard process for data mining (CRISP-DM).

Let’s refer to the cross industry standard process for data mining (Fig. 1) that gives kind of guideline about how machine learning projects are developed in the industry.

Ideally it all starts with a very good business understanding combined with a very good data understanding. Then typically there is a phase of data preparation where you might need to bring the different data sources together, transform them and enrich them etc. This is what we do in Datasphere in our storyline. And then you can start with the modeling and this is where typically your data scientist will spend some time and try different kind of models to see what works best and tune the models to find the optimal parameters.

There is a very important evaluation phase where you want to test how your model is performing against a holdout dataset. to evaluate very well what’s the accuracy of your model, what’s the level of confidence of the prediction that it’s making. And then very, very important phase is the deployment of the machine learning model. This is the phase where you start getting the profit from the whole cycle. You are applying your model to score new fresh dataset and this gives value for your company.

But how SAP infuses intelligence in the business applications and solutions?

Figure%202%3A%20SAP%20AI%20solution%20portfolio.

Figure 2: SAP AI solution portfolio.

In SAP, there are different layers of AI integration into business processes and solutions (see Fig. 2). When using SAP solutions like S/4HANA, Ariba, and others, AI functionality is already available and can be easily utilized. In most cases, there is no need for additional actions as AI is readily accessible.

However, if the existing AI capabilities are not fully aligned with specific business requirements, organizations can enhance their experience by leveraging AI Business Services. These services provide complex functionalities such as extracting information from scanned documents, which can be seamlessly integrated into existing business processes through AI API calls and code development. Additionally, SAP’s Build Suite enables the embedding of these functionalities into business processes using process automation products.

Apart from the predefined AI Business Services, SAP offers other solutions within the SAP Business Technology Platform (BTP) that incorporate AI capabilities. These include HANA Cloud, SAP Analytics Cloud and Data Intelligence that will be covered in this blog post and in the next one.

Today customization and extension of AI to align with unique business processes is fundamental. To this end SAP introduced AI Core and AI Launchpad, where organizations can incorporate their own AI models, developed by data scientists and AI engineers, and further train and embed them into SAP solutions. This is a more advanced approach not covered here, but that was extensively covered in a past Open SAP course entitled Building AI and Sustainability Solutions on SAP BTP that you can check if interested.

SAP HANA Machine Learning

With this blog post and with our story line we will cover two SAP products with AI capabilities: HANA ML and Data Intelligence. Let’s start talking about the first HANA Machine Learning. What is exactly SAP HANA Machine Learning?

HANA ML is a library of machine learning algorithms that run embeddedly in HANA database. It’s a very comprehensive set of tools and algorithms (see Fig. 3) that can help you cover basically all the different kind of machine learning scenarios from classification or regression, text analytics, streaming analytics, time series forecasting and so on and so forth.

Figure%203%3A%20HANA%20Machine%20Learning%20overview.

Figure 3: HANA Machine Learning overview.

It is made of two main building blocks, the predictive analysis library PAL and the automated predictive library APL.

Predictive analysis library, PAL, is a collection of machine learning algorithms. It’s basically the counterpart of SciKit Learn in Python and it can address really all the key scenarios for machine learning in the industry: classification, regression, time series, forecasting and so on and so forth.

On the other hand, we have Automated Predictive Library APL. This is something a little bit different. This is not a collection of algorithms, but rather a collection of methods that allow you to build machine learning workflows end to end. It can cover most of the key scenarios. Basically, the way it works is that you call just the methods related to the task that you want to perform, classification for instance, you state what are the input datasets that you want to use for the analysis and the method will take care of building a machine learning model end to end. It will take care of the variable selection within the input dataset and the data preparation. It will build and train different models to achieve the same task, it will compare the performance and give you the best model, the ones that performs better.

So, it’s a little bit more suitable for personas that don’t have the need of controlling every detail of the machine learning scenarios, so for instance the business personas. Please, note that this automated predicted library is also what is running under the hoods in SAP Analytics Cloud to power the smart features (smart discovery, smart predict, predictive planning).

On top of those algorithms, HANA ML also offers the possibility of integrating in HANA DB some external machine learning scripts. In particular, we have the R integration and the Tensorflow integration that are achieved through two external kind of servers, the Tensorflow serving servers and the R servers.

Last but not least, it can be used in different ways and by developers that have different background. As a matter of fact, since it runs embeddedly in an HANA DB, SAP HANA ML methods can be called with SQL language. But we have also an open source API that allows data scientists to use HANA ML from R Studio or from a Jupiter notebook. Let’s see an example of how HANA ML can help in our storyline.

ML Scenario 1: Book Recommendation using HANA ML

Let’s suppose that our bookshop owner wants to develop a book recommendation engine for its online shop, so that when a customer browses in the shops and selects “Marlon Bundo”, for instance, and it puts in into the basket, a nice window will pop up suggesting what are the other books that other customers have also bought together with “Marlon Bundo”.

How do we help the bookshop owner achieving his goal? We will use as a data source the history of book sales that is stored in Datasphere and a HANA ML algorithm for the for building the book recommendation. In particular, we can use the Apriori algorithm that is part of the PAL library.

You can see how to do it in the video linked below.

Data and ML orchestration with SAP Data Intelligence

Now let’s have a look at the AI capabilities of SAP Data Intelligence. As we have seen in the previous blog posts, Data Intelligence is the data orchestration pillar of BTP. It has different functionalities to support all the stages of the machine learning life cycle. So, it provides tools to help you discover, browse and manage your data.

In particular, within the metadata catalog you have tools to aid you in the data preprocessing stages, you can develop some ETL if needed. There are also tools to help you in the model creation and in the operationalization and deployment of your models.

As a reminder, Data Intelligence is not just a tool to build SAP models or to operationalize SAP models, for instance taken from HANA ML. This is a very powerful orchestrator suitable also for open source and heterogenous data processing and data transformation engines. So you can decide to build machine learning with libraries that have nothing to do with HANA ML. For instance, Python, Tensorflow, R, OpenCV and so on.

Figure%204%3A%20Data%20science%20artifact%20management%20in%20a%20ML%20Scenario.

Figure 4: Data science artifact management in a ML Scenario.

Let’s now have a look at some of the DI features. Let’s start from Machine Learning Scenario Manager: it is the place where you can really start to build your machine learning model within Data Intelligence. It helps you to organize your data science artifacts and to keep all the resources related to the same machine learning scenario, to the same business use case, in one single place. So, it’s like having a folder with all the tools that you need to build your machine learning scenario.

When you open a particular machine learning scenario, you will find different sections (see Fig. 4). You will see a section where you can link the datasets relevant for the particular scenario you are dealing with and you can also have a direct link to pipelines that will help you to automate your machine learnin. But most interestingly, you have a section where you can create Jupiter notebooks.

So you have access directly to a Jupiter Lab environment within Data Intelligence (see Fig. 5). Why is tJupiter Notebook environment embedded in Data Intelligence? Actually, it is there to experiment with ML in Python.

Figure 5: An example of Jupyter notebook in Data Intelligence.

As you most probably know, Python is the most widely used language for data science and it’s very important to enable data scientists to code in their own preferred language.

Having Python directly embedded into Data Intelligence is very convenient because if you code in Python directly within Data Intelligence, it’s very easy to have access to all the data sources that are connected to Data Intelligence.

Moreover, in Python you have also access to our HANA ML Python API, so you can run HANA ML from the Jupiter Notebook. And last but not least, we can also leverage totally open source models. In the machine learning context that is constantly evolving, it’s very important to stay open to the data science community. You might want to use the latest model for to solve very specific scenario. You might want to leverage a pretrained model that is available out there rather than spending time in training your own model, maybe on a huge dataset and so on and so forth. And you have the possibility of doing that within Data Intelligence. Let’s see how it works with an example, the second machine learning scenario that deals with book clustering through text analysis with open-source Python.

ML Scenario 2: Text Analysis and Book Clustering with Python

Suppose John, our bookshop owner, wants to put some order in his book inventory and he wants to group together books having the same topics, the books having similar content. It doesn’t have any a priori classification that he wants to respect. He wants to be driven by the content of the book, so the real data.

How can we do that? We can simply leverage the description of the book.

Within the inventory each book comes with a brief text or description of the content, we can try to extract the semantics from that small text. You can see how to do it in the video linked below.

ML pipelines in SAP Data Intelligence

SAP Data Intelligence can be used to automate the ML models operations by creating pipelines. And the Modeler is the key tool to do that.

The Modeler is a graphical drop and drop tool where you can build different kind of data transformation pipelines. These pipelines can cover a very broad range of use cases, and these are very suitable also for machine learning scenarios.

As a matter of fact, for machine learning, you usually need to go through several stages of data acquisition and preparation, as well as train and deploy machine learning models, and then store the machine learning output in some target system for further analysis and consumption. Therefore, all these steps can be covered with great flexibility with Data Intelligence Modeler.

Figure 6: The Modeler and an example of template pipeline in SAP Data Intelligence.

In this Modeler tool you will find several templates and examples that are very useful resources that will help you to get started with machine learning application with different tools, for instance, HANA ML, basic Python, TensorFlow, R, etc.

Each one of these templates is very well documented. You will find the documentation directly within the Modeler.

In terms of the operators, that is the basic nodes that you can put in your pipeline, there are quite a few choices that will help you dealing with machine learning. For instance, you have prebuilt nodes to train and score HANA ML models. You have also prebuilt operator templates for Python models.

You have artifacts consumer and artifacts producer that will help you creating and reading machine learning models once they are trained and operators that help you keeping track of your performance metrics and so on and so forth. Moreover, you have also a set of machine learning functional services already packed into operators. For instance, if you need to use an OCR or if you need to do some image classification, you can also find out-of-the-box operators that are already trained and you can use.

If you don’t find the operators that you specifically need for your pipeline, you are always free to wrap any kind of script into an operator or you can customize the already existing operators. You can see one particular scenario where we might want to develop a custom operator and how we can do it in the video below.

Now let’s see how to use all the DI capabilities and tools to really build an end-to-end machine learning solution.

Most of the machine learning scenarios consist of two phases. There is an offline phase where you basically build your model and you train your model against the historical data. Once you have done building your model, you can save it in an artifact that can be used later in the production environment to make real-time predictions. This artifact is very often a model saved in binary format. For instance, we have seen for the book clustering scenario, we have saved the KMeans trained model in a binary file format (pickle file). But the artifacts can also come in different formats. For instance, in the book recommendation scenario, we have just built an association table with the antecedent and consequent book pairs and that was only a plain table into Datasphere.

Once you have built the machine learning model and you have your artifact ready, you typically want to expose this model so that it can make prediction on-demand through a serving pipeline. This can be done via REST API (see Fig. 7).

Figure%207%3A%20Architecture%20for%20wrapping%20trained%20models%20as%20deployable%20services.

Figure 7: Architecture for wrapping trained models as deployable services (source).

We will see how to put in place in an automated way both the training phase and the prediction phase.

Let’s start with the training phase. So why it is important to automate the training phase? Essentially because we want our model to be reproducible.

For instance, we might want to refresh our model periodically, for instance each month or each six months we might want to retrain our model to include the most up to date data we have collected in the meantime. And we want to do this simply by just pushing a button. You can see how to automatize the training phase of the book clustering model within Data Intelligence in the video linked below.

And what about the deployment? In the video below you can see how we can expose the book clustering artifact to an external client in Data Intelligence.

Key takeaways

Let’s recap what we learnt in this blog post with the help of our real-life use case.

Firstly, we have learnt that SAP HANA embedded Machine Learning provides a rich set of functions with In-Memory performance for most classic ML scenarios, and that the Python and R APIs allow Data Scientist to directly leverage SAP HANA’s ML capabilities and bring their productivity to where the ML scenario shall be operationalized.

On the other hand, we have learned that Data Intelligence is a versatile tool for integrating many data sources and that it can be used to create custom pipelines to orchestrate data preparation, to perform some ETL or to operationalize the main stages of an ML solution, training and deployment of a ML Model, regardless of whether it comes from HANA ML or any other open-source library.

Now, if you would like to learn more about the consumption layer in our story and how to present data effectively and efficiently providing valuable insights and help in making informed decisions, move on to the next and final blog post in the series.