employee_id, salary, cumulative_sum
10, 1000, 1000
20, 2000, 3000
30, 3000, 6000
40, 5000, 11000
employee_id, salary, cumulative_sum
10, 1000, 1000
20, 2000, 3000
30, 3000, 6000
40, 5000, 11000
For Business Intelligence (BI) market is very much dependent on ETL architecture. The Extract, Transform and Loading products have become far more important in the data driven age. DataStage is one of the most important ETL tools which effectively integrate data across various systems. DataStage designs jobs that manage the collection, transformation, validation and loading of data from different systems to data warehouses. DataStage facilitates business analysis through its user friendly interface and providing quality data to help in gaining business intelligence. With IBM acquiring DataStage in 2005, it was renamed to IBM WebSphere DataStage and later to IBM InfoSphere.
DataStage has four components namely Administrator, Manager, Designer and Director. DataStage has various versions such as Server Edition, Enterprise Edition, MVS Edition and DataStage for PeopleSoft.
This component of DataStage provides a user interface for administrating projects. It also manages global settings and maintains interactions with various systems. The Administrator’s role ranges from setting up users and project properties to adding, moving and deleting projects. It specifies general server defaults and purging criteria. A command interface is provided by Administrator for DataStage Repository. It plays a crucial role in managing job scheduling options, user privileges, setting up parallel job defaults and specifying job monitoring limits.
To view and edit the contents of DataStage repository, the DataStage Manager is considered to be the main interface of the DataStage repository. Whether you want to browse the DataStage repository or store and manage reusable Meta data, DataStage Manager renders all these services. Tables and files layouts, jobs and transforms routines which are defined in the project are displayed by it. It has a crucial role in managing all the tasks related to DataStage repository.
The designer helps in creating DataStage jobs or application by providing a design interface. These jobs are then complied to form executable programs. Each job explicitly specifies the source of data, required transforms and the destination of data as well. DataStage Director is responsible for scheduling the executables which are created from compiling these jobs. Designer also provides a user friendly graphical interface. The server takes care of running these executable programs. This module is used by developers. The extraction, cleansing, transformation, integration and loading of data is performed via a visual data flow method.
As mentioned earlier, DataStage Director provides an interface which schedules executable programs formed by the compilation of jobs. It runs, validates, schedules and monitors server jobs and parallel jobs. The Director interface plays a vital role in parallel processing. The main users of this interface are testers and operators.
DataStage is designed to work with large volumes of data as it can collect, integrate and transform large volumes of data which have different data structures. It also supports Big Data and Hadoop as it lets you access Big Data directly on distributed networks. It facilitates seaming less connectivity between different data sources and applications. It also helps in optimizing hardware utilization and can prioritize mission critical tasks.
About the Author:
Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.com, a global training company that provides e-learning and professional certification training.
The courses offered by Intellipaat address the unique needs of working professionals. She is based out of Bangalore and has an experience of 5 years in the field of content writing and blogging. Her work has been published on various sites related to Qlikview Training, Big Data, Business Intelligence, Project Management, Cloud Computing, IT, SAP, Project Management and more.
When you run a Parallel Job , any error messages and warnings are written to an error log and can be viewed from the director . You can choose to handle specified errors in a different way by creating one or more message handlers. A message handler defines rules about how to handle messages generated when a parallel job us running . You can , for example , use one to specify that certain types of message should not be written to the log.
Project Level: Handled at the Datastage Administrator and this applies to all parallel jobs with in the specified project.
Job Level : From the designer and Manager you can specify that any existing handler should apply to a specific apply to a specific job when you compile the job, the handler is included in the job executable as a local handler.
You can view, edit, or delete Message Handler from the Message Handler Manager. You can also define new handlers if you are familiar with the message IDS ( although note that Datastage will not know whether such messages are warnings or informational). The preferred way of defining new handlers is by using the add to rule message handler feature.
This section describes Datastage user categories and how to change the assignment of these categories to operating system user groups.
Datastage User Categories : To Prevent unauthorized access to Datastage Projects .
There are four categories of Datastage User
Datastage Operator : Who has permission to run and manage Datastage Jobs.
Datastage Developer: Who has full access to all areas of a Datastage Project.
Datastage Production Manager: Who has full access to all areas of a Datastage Project and can also create and manipulate protected projects. Current on UNIX systems the Production Manager must be root or the administrative user in order to protect or unprotect projets.
<None>:Who doest not have permission to log on to Datastage.
Maintaining Job Log Files every Datastage job has a log file and every time you run a Datastage Job, new entries are added to the log file . To Prevent the files from becoming too large , they must be purged from time to time . You can set project wide defaults for automatically purging job logs or purge them manually, Default settings are applied to newly created Jobs, not existing ones. To set automatic purging or a project.
This section describes Datastage user categories and how to change the assignment of these categories to operation system user groups.
Datastage User Categories : To Prevent unauthorized access to Datastage projects .
There are four categories of Datastage User.
Datastage Operator: Who has permission to run and manage Datastage Jobs.
Datastage Developer : Who has full access to all areas of a Datastage project.
Datastage Production Manager : Who has full access to all areas to all areas of a Datastage Project and can also create and manipulate prtected projects. Currently on UNIX systems the production Manager must be administrative user in order to protect or unprotect projects.
<None>: Who does not have permission to log on to Datastage.
DataStage is an application on a server which acts as an ETL tool that extracts transforms and loads data. DataStage connects to data sources and transforms the data as it moves through the applications. It integrates data across multiple systems using a high performance parallel framework and it supports extended metadata and enterprise connectivity. DataStage facilitates flexible integration of all types of data through its scalable platform whether the data is at rest (Hadoop based) or in motion (stream-based). DataStage has three levels of parallelism namely Pipeline Parallelism, Data Parallelism and Component Parallelism. It uses a client-server design.
DataStage provides some high end features and benefits as listed below
DataStage is a scalable ETL platform which facilitates collection, integration and transformation of large volume of data with data structures ranging from simple to complex.
2 Big Data & Hadoop support
It enables users to directly access big data on a distributed file system. It provides JSON support and a new JDBC connector which helps clients in taking leverage of new data sources.
3 Real time data integration
It not only provides a near real time integration of data but also supports connectivity between data sources and applications.
4 Work load management
It helps users in prioritizing mission critical task and helps in optimizing hardware utilization
5 Ease of use
It is simplifies managing data integration infrastructure by improving speed, flexibility and effectiveness to build, deploy and update.
6 Security Controls
It allows researchers to have a private area which is only accessible to them and the group leader. There can also be shared and collaborative areas for files to be accessed by whole research group.
7 Web Interface
Users can access data from outside their personal computer and can annotate their files as well.
8 Data Repository
There is an option to send data for permanent storage in a repository.
Using DataStage via web interface
Following is the example for using the web interface
The web interface cannot deal with filenames which has spaces in the title. The files with spaces in title will be safely removed. The better practices would be to remove spaces from filenames and directory. Following are the steps
It allows you to map a connection straight to a data server which gives direct access to files. This uses a protocol which is called samba. This is a great tool for wired network connections where everyone is connected using Ethernet cables. Samba is not considered very secure and for this reason most internet service providers block the ports required to do this. VPN can help you get around this problem.
About the Author:
Vaishnavi Agrawal loves pursuing excellence through writing and have a passion for technology. She has successfully managed and run personal technology magazines and websites. She currently writes for intellipaat.in, a global training company that provides e-learning and professional certification training.
The courses offered by Intellipaat address the unique needs of working professionals. She is based out of Bangalore and has an experience of 5 years in the field of content writing and blogging. Her work has been published on various sites related to Big Data, Business Intelligence, Project Management, Cloud Computing, IT, SAP, Project Management and more.
Select this to have Datastage automatically handles failing jobs with in a sequence ( this means that you do not have to have a specific trigger for job failure) when you select this option , the following happens during job sequence compilation . For each job activity that does not have a specific trigger for error handling, code is inserted that branches to an error handling point. ( If an activity has iether a specific failure trigger , or if has an OK trigger and an otherwise trigger, it is judged to be handling its own aborts. So on code is inserted. ) If the compiler has inserted error handling code the following happens if a job within the sequence fails.
Activities Can only have one input trigger , but can have multiple output trigger. Trigger names must be unique for each activity. For example, you could have several triggers caked ” success” in a job sequence, but each activity can only have one trigger called” success” .
Conditional: A Conditional trigger fires the target activity if the source activity fulfills the specified condition. The condition is defined by an expression and can be one of the following types.
OK : Activity Succeeds .
Failed: Activity Fails
Warnings: Activity produced warnings
Return Value: A routine or command has returned a value.
Custom: Allows you to define a custom expression.
User Status: Allows you to define a custom status message to write to the log.