An overview of DataStage components

For Business Intelligence (BI) market is very much dependent on ETL architecture. The Extract, Transform and Loading products have become far more important in the data driven age. DataStage is one of the most important ETL tools which effectively integrate data across various systems.  DataStage designs jobs that manage the collection, transformation, validation and loading of data from different systems to data warehouses.  DataStage facilitates business analysis through its user friendly interface and providing quality data to help in gaining business intelligence.  With IBM acquiring DataStage in 2005, it was renamed to IBM WebSphere DataStage and later to IBM InfoSphere.

DataStage has four components namely Administrator, Manager, Designer and Director.  DataStage has various versions such as Server Edition, Enterprise Edition, MVS Edition and DataStage for PeopleSoft.

Administator

This component of DataStage provides a user interface for administrating projects.  It also manages global settings and maintains interactions with various systems. The Administrator’s role ranges from setting up users and project properties to adding, moving and deleting projects. It specifies general server defaults and purging criteria.  A command interface is provided by Administrator for DataStage Repository.  It plays a crucial role in managing job scheduling options, user privileges, setting up parallel job defaults and specifying job monitoring limits.

Manager

To view and edit the contents of DataStage repository, the DataStage Manager is considered to be the main interface of the DataStage repository. Whether you want to browse the DataStage repository or store and manage reusable Meta data, DataStage Manager renders all these services. Tables and files layouts, jobs and transforms routines which are defined in the project are displayed by it.  It has a crucial role in managing all the tasks related to DataStage repository.

Designer

The designer helps in creating DataStage jobs or application by providing a design interface.  These jobs are then complied to form executable programs.  Each job explicitly specifies the source of data, required transforms and the destination of data as well.  DataStage Director is responsible for scheduling the executables which are created from compiling these jobs. Designer also provides a user friendly graphical interface. The server takes care of running these executable programs.  This module is used by developers. The extraction, cleansing, transformation, integration and loading of data is performed via a visual data flow method.

Director

As mentioned earlier, DataStage Director provides an interface which schedules executable programs formed by the compilation of jobs.  It runs, validates, schedules and monitors server jobs and parallel jobs. The Director interface plays a vital role in parallel processing.  The main users of this interface are testers and operators.

 

DataStage is designed to work with large volumes of data as it can collect, integrate and transform large volumes of data which have different data structures.  It also supports Big Data and Hadoop as it lets you access Big Data directly on distributed networks. It facilitates seaming less connectivity between different data sources and applications.  It also helps in optimizing hardware utilization and can prioritize mission critical tasks.

 

About the Author:

 

Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.

The courses offered by Intellipaat address the unique needs of working professionals. She is based out of Bangalore and has an experience of 5 years in the field of content writing and blogging. Her work has been published on various sites related to Qlikview Training, Big Data, Business Intelligence, Project Management, Cloud Computing, IT, SAP, Project Management and more.

 

 

Massage Handler

When you run a Parallel Job , any error messages and warnings are written to an error log and can be viewed from the director . You can choose to handle specified errors in a different way by creating one or more message handlers. A message handler defines rules about how to handle messages generated when a parallel job us running . You can , for example , use one to specify that certain types of message should not be written to the log.

Project Level: Handled at the Datastage Administrator and this applies to all parallel jobs with in the specified project.

Job Level : From the designer and Manager you can specify that any existing handler should apply to a specific apply to a specific job when you compile the job, the handler is included in the job executable as a local handler.

You can view, edit, or delete Message Handler from the Message Handler Manager. You can also define new handlers if you are familiar with the message IDS ( although note that Datastage will not know whether such messages are warnings or informational). The preferred way of defining new handlers is by using the add to rule message handler feature.

What is Permissions Page

This section describes Datastage user categories and how to change the assignment of these categories to operating system user groups.

Datastage User Categories : To Prevent unauthorized access to Datastage Projects .

There are four categories of Datastage User

Datastage Operator : Who has permission to run and manage Datastage Jobs.

Datastage Developer: Who has full access to all areas of a Datastage Project.

Datastage Production Manager: Who has full access to all areas of a Datastage Project and can also create and manipulate protected projects. Current on UNIX systems the Production Manager must be root or the administrative user in order to protect or unprotect projets.

<None>:Who doest not have permission  to log on  to Datastage.

Auto Purging of Job Logs

Maintaining Job Log Files every Datastage job has a log file and every time you run a Datastage Job, new entries are added to the log file . To Prevent the files from becoming too large , they must be purged from time to time . You can set project wide defaults for automatically purging job logs or purge them manually, Default settings are applied to newly created Jobs, not existing ones. To set automatic purging or a project.

Follow our site Datastage to get more updates on Datastage Tutorial

Permissions Page in Datastage

This section describes Datastage user categories and how to change the assignment of these categories to operation system user groups.

Datastage User Categories : To Prevent unauthorized access to Datastage projects .

There are four categories of Datastage User.

Datastage Operator: Who has permission to run and manage Datastage Jobs.

Datastage Developer : Who has full access to all areas of a Datastage project.

Datastage Production Manager : Who has full access to all areas to all areas of a Datastage Project and can also create and manipulate prtected projects. Currently on UNIX systems the production Manager must be administrative user in order to protect or unprotect projects.

<None>: Who does not have permission to log on to Datastage.

 

 

DataStage features and usage via web interface

DataStage is an application on a server which acts as an ETL tool that extracts transforms and loads data. DataStage connects to data sources and transforms the data as it moves through the applications.  It integrates data across multiple systems using a high performance parallel framework and it supports extended metadata and enterprise connectivity.  DataStage facilitates flexible integration of all types of data through its scalable platform whether the data is at rest (Hadoop based) or in motion (stream-based).  DataStage has three levels of parallelism namely Pipeline Parallelism, Data Parallelism and Component Parallelism. It uses a client-server design.

DataStage provides some high end features and benefits as listed below

1 Powerful

DataStage is a scalable ETL platform which facilitates collection, integration and transformation of large volume of data with data structures ranging from simple to complex.

2 Big Data & Hadoop support

It enables users to directly access big data on a distributed file system. It provides JSON support and a new JDBC connector which helps clients in taking leverage of new data sources.

3 Real time data integration

It not only provides a near real time integration of data but also supports connectivity between data sources and applications.

4 Work load management

It helps users in prioritizing mission critical task and helps in optimizing hardware utilization

5 Ease of use

It is simplifies managing data integration infrastructure by improving speed, flexibility and effectiveness to build, deploy and update.

6 Security Controls

It allows researchers to have a private area which is only accessible to them and the group leader.  There can also be shared and collaborative areas for files to be accessed by whole research group.

7 Web Interface

Users can access data from outside their personal computer and can annotate their files as well.

 

8 Data Repository

There is an option to send data for permanent storage in a repository.

Using DataStage via web interface

Following is the example for using the web interface

The web interface cannot deal with filenames which has spaces in the title. The files with spaces in title will be safely removed.  The better practices would be to remove spaces from filenames and directory. Following are the steps

  1. You can navigate to your target folder through “Browse data” options
  2. Click “upload file” which is located at top of the screen
  3. You will then be provided a pop up box which will allow you to browse the file you want to upload from your system.
  4. While uploading a single file the file will appear on the screen you are currently viewing in your web interface. You can edit the “title” and add more information in the “description” Meta data options. The owner of the file who has originally uploaded the file can only edit the title and description fields. Even the Group leader can’t edit the title and description of someone else’s files while working collaboratively.
  5. To upload the whole directory which consists of many folders, you will have to first compress the file. For this you have to convert the target directory into .zip file on your local computer. You can then upload the .zip file. DataStage automatically unzips a zipped file and unpack its contents. You don’t have to bother about series of nested folders of .zip file as DataStage will create the same file tree. The only problem arises when there is a mixture of directory and stand alone files at that would not work in DataStage. It will work perfectly fine if your zipped folder only contains folders or it only contains stand alone files.
  6. If you are using subdirectories for example in the “Collab” area then only the owner of the top most directory can use the web interface to upload further files within that branch of file structure. Other users can upload files via mapped drive interface.

 

  1. The “permissions” file holds the information about which permissions go with that particular user and folder. This file acts as a useful reference for you and other users.
  2. To delete a file you have to press “delete”. Only the owner of the file can delete the file in the web interface. There is an error in deleting the file via web interface as the file name remains in the list even when the file is not accessible.

 

 

 

Mapped Drive

It allows you to map a connection straight to a data server which gives direct access to files. This uses a protocol which is called samba. This is a great tool for wired network connections where everyone is connected using Ethernet cables. Samba is not considered very secure and for this reason most internet service providers block the ports required to do this. VPN can help you get around this problem.

 

About the Author:

 

Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.in, a global training company that provides e-learning and professional certification training.

The courses offered by Intellipaat address the unique needs of working professionals. She is based out of Bangalore and has an experience of 5 years in the field of content writing and blogging. Her work has been published on various sites related to Big Data, Business Intelligence, Project Management, Cloud Computing, IT, SAP, Project Management and more.

 

Run Time Column Propogation

  • Runtime column Propogation ( RCP)
    • allows you to define only part of your Table Definition ( Schema)
    • Extra Columns Propogate through the rest of the job.
  • RCP must first be enabled at the project level ( off by default )
    • Can be enabled /disabled at the job level.
    • Can enabled/disabled at the stage level ( Output  Columns)
  • RCP allows maximum reuse if parallel shared containers.
    • Input and Output Table Definitions only need the columns required by the container stage. Different schemas are OK as long as the core columns exists .
    • Must enable RCP in every stage of the shared Container.

Automatically handle Job runs that fail

Select this to have Datastage automatically handles failing jobs with in a sequence ( this means that you do not have to have a specific trigger for job failure) when you select this option , the following happens during job sequence compilation . For each job activity that does not have a specific trigger for error handling, code is inserted that branches to an error handling point. ( If an activity has iether a specific failure trigger , or if has an OK trigger and an otherwise trigger, it is judged to be handling its own aborts. So on code is inserted. ) If the compiler has inserted error handling code the following happens if a job within the sequence fails.

 

Types of Trigger

Activities Can only have one input trigger , but can have multiple output trigger. Trigger names must be unique for each activity. For example, you could have several triggers caked ” success” in a job sequence, but each activity can only have one trigger called” success” .

Conditional: A Conditional trigger fires the target activity if the source activity fulfills the specified condition. The condition is defined by an expression  and can be one of the following types.

OK : Activity Succeeds .

Failed: Activity Fails

Warnings: Activity produced warnings

Return Value: A routine or command has returned a value.

Custom: Allows you to define a custom expression.

User Status: Allows you to define a custom status message to write to the log.