""

Managing ETL dependencies with BusinessObjects Data Services (Part 1)

Are you satisfied with the way you currently manage the dependencies in your ETL? Dependencies between jobs (or parts of jobs) are an important aspect of the ETL management. It pertains to questions like: Do you want to execute job B if job A failed? Imagine that you have a job C with sub-job 1 (usual runtime: 3 hours) and sub-job 2 (usual runtime: 2 minutes). If sub-job 1 was successful and sub-job 2 failed, can you gracefully restart job C without the sub-job 1 being restarted again?

As soon as you have more than 1 simple job, you have to manage your dependencies. In this article (part 1 of a series of articles about ETL Dependencies Management) I’ll first list some of the characteristics I’m looking for in an ideal dependency management system. I will then have a look at some of the possibilities offered by SAP Data Services 4. In part 2 (my next post), I will propose the architecture of a possible dependency management system. In part 3, I will go into the details of the implementation in Data Services. I’ll finish with part 4 by telling you about how the implementation went, and if some improvements are possible.

The ideal dependency management system

In this post I will use the word “process” to design a series of ETL operations that have a meaning together. Example: extract a source table, create a dimension, or update a fact table. The objective here is to manage the dependencies between the processes: updating a fact table should probably only be allowed if updating the corresponding dimensions was successful.

A dependency management system should ideally have at least the following characteristics:

  • Run a process only if its prerequisites ran correctly
  • After a failure, offer the option to re-run all the processes or only the processes which failed
  • Trace the outcome of each process (ran successfully, failed, did not run)
  • Run dependent processes dynamically (rather than statically, i.e. based on date/time)

The possibilities

Let’s enumerate some of the possibilities offered by Data Services, with their respective pros and cons.

1) One job with all processes inside. This is very easy to implement, dynamic in terms of run times, but it doesn’t allow for concurrent runs. Most importantly, it means that failures have to be managed so that the failure of one process does not stop the whole job.

2) One process per job, with jobs scheduled at specific times. This is very easy to implement, allows concurrent runs, but is not dynamic enough. If the process durations increase with the months/years, jobs may overlap.

3) One main job calling other jobs (for example with execution commands or Web Services).

4) One process per job, all the jobs being scheduled at specific times, but checking in a control table if the pre-requisites ran fine. Otherwise they just sleep for some time before checking again.

5) Use the BOE Scheduler to manage jobs based on events (how-to is well described on the SCN). I’ve not tested it yet, but I like this approach.

By default, the first two possibilities only manage the “flow” side of the dependency management (after A, do B). But they do not manage the conditional side of the dependency management (do B only if A was successful). In both cases, a control table updated by SQL scripts would allow the ETL to check if the prerequisite processes have been run correctly.

What I don’t really like in the solutions 2 to 5 is the fact that it’s difficult to have an overview of what’s going on. You cannot really navigate within the whole ETL easily. The solution 1 gives you this overview, but at the cost of having a potentially huge job (without the possibility of processes running concurrently).

Also note that the solutions with multiple jobs will need to manage the initialization of the global variables.

What I miss in all these solutions is an optimal re-start of the ETL. If 10 of my 50 processes failed, and I want to restart these 10 only, do I really have to start them manually?

In my next blog post I’ll propose an architecture that addresses this optimal restart.

Until then, please let me know your thoughts about how you manage your ETL dependencies. Any of the 5 solutions mentioned before? A mix? Something else? And how well does it work for you.

Use Data Services SDK libraries to construct an AWTableMetadata in a Java application

If you have a Java application that returns a table and you are planning to use this as a source of information for SAP Data Services, the best way is to return a table with the same data type as the Data Services Template table “AWTableMetadata”. I will explain how to easily do that in this article.

First you need to go to the libraries folder inside your SAP BusinessObjects installation (…SAP BusinessObjectsData Serviceslib). From this folder we have to import the following libraries to our Eclipse Java project.

  • Acta_Adapter_sdk.jar
  • Acta_broker_client.jar
  • Acta_Tool.jar

The easiest way is to put these libraries inside your Java ext libraries folder so your application will import it automatically. Also if you’re planning to deploy this application on a server you need to place this library inside the server library folder too.

  • ….Javajdk1.7.0jrelibext
  • …..Javajre7libext
  • …..SAP BusinessObjectsTomcat6lib

Import these libraries inside the project:

Import com.acta.metadata.AWAttribute;

Import com.acta.metadata.AWColumn;

Import com.acta.metadata.AWTableMetadata;

Once we have our libraries imported inside our Java project we have to assign the return value for the function in charge of constructing the table as the same data type for the table.

Public static AWTableMetadata createAWTable () throws Exception {…]

Then we are ready to construct our table. To do so we have to:

  1. Declare the table:
    1. AWTableMetadata awTable = new AWTableMetadata () ;
    2. awTable.setTableName("……");
  2. Assign the rows and columns
    1. AWAttribute [] attributes = new AWAttribute [2000] ;
    2. AWColumn [] columns = new AWColumn [2000] ;
  3. Assign the Attributes and columns to our table:
    1. awTable.setColumns(columns);
    2. awTable.setAttributes(attributes);

Finally we have to make the return statement as “return awTable”.

In conclusion, once we have our function done we will be able to communicate and exchange data with data services through our application in this case with a table and be able to use our application as a Data Source.

If you have any doubts or suggestions, please leave a comment below.