Thursday, April 4, 2013

Aggregator Transformation

The Aggregator Transformation is an Active and Connected transformation.

The Aggregator transformation is used to performs aggregate calculations, such as averages, count first, last, max, median, min, percentile, stddev, sums and variance.

The Integration Service performs aggregate calculations as it reads and stores data group and row data in an aggregate cache. The Aggregator transformation is unlike the Expression transformation but is use to Aggregator transformation to perform calculations on groups.

The aggregate is also you use conditional clauses to filter rows, providing more flexibility than SQL language. 

Components of the Aggregator Transformation
The Aggregator is an active transformation that changes the number of rows in the pipeline. The Aggregator transformation has the following components and options: 
  • Aggregate cache. The Integration Service stores data in the aggregate cache until it completes aggregate calculations. The Integration Service stores group values in an index cache and it stores row data in the data cache.
  • Aggregate expression. Enter an expression in an output port. The expression can include non-aggregate expressions and conditional clauses.
  • Group by port. Indicate how to create groups. You can configure an input, input/output, output, or variable port for the group. When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified.
  • Sorted input. Select this option to improve session performance. To use sorted input, you must pass data to the Aggregator transformation sorted by group by port, in ascending or descending order.
Note: The Integration Service uses memory to process an Aggregator transformation with sorted ports. The Integration Service does not use cache memory. You do not need to configure cache memory for Aggregator transformations that use sorted ports.

1. Aggregate Caches:
When you run a session that having an Aggregator transformation, the Integration Service creates the index and the data caches in memory to process the transformation. If the Integration Service requires more space, it stores overflow values in cache files.

You can configure the index and the data caches in the Aggregator transformation or in the session properties. Or, you can configure the Integration Service to determine the cache size at run time.

a. Index Cache:
          The index cache holds group information from the group by ports. If we are using Group By on Port like 'DEPTNO', then this Index cache will stores values of Grouped ports like 10, 20, 30 etc.

b. Data Cache:
          The DATA CACHE size is generally larger than the INDEX CACHE.
It will store information like
  1. Variable port (if any created)
  2. Non-Grouped Input/Output Port
  3. Port for which aggregate functions are used
  4. Non group by input ports used in non-aggregate output expression 
2. Aggregate Expressions
An aggregate expression can include conditional clauses and non-aggregate functions. The expression can also include one aggregate function within another aggregate function, such as: MAX( COUNT( ITEM ))

The result of an aggregate expression varies based on the group by ports in the transformation. If you're not defined any group by port(s) then the Integration service will return only one row as out. It finds the total quantity of items sold:
SUM( QUANTITY )

If you use the same expression, and you group by the ITEM port, the Integration Service returns the total quantity of items sold, by item. You can create an aggregate expression in any output port and use multiple aggregate ports in a transformation.

You can nest one aggregate function within another aggregate function. When you use any of these functions, you must use them in an expression within an Aggregator transformation.

Aggregate Function are averages, count first, last, max, median, min, percentile, stddev, sums and variance.

Normal Aggregate Functions:
max(salary) or min(salary), avg(salary), sum(salary), count(employee)...etc.

Nested Aggregate Functions:
You can include multiple single-level or multiple nested functions in different output ports in an Aggregator transformation. You cannot include both single-level and nested functions in an Aggregator transformation.
If an Aggregator transformation contains a single-level function in any output port, you cannot use a nested function in any other port in that transformation. When you include single-level and nested functions in the same Aggregator transformation, then the mapping or mapplet which be invalid.

Example: 
  • max(salary) is valid
  • min(max(salary)) is in-valid.
  • max(count(product)) is valid
  • min(max(count(product))) is in-valid.
If you want both single-level and nested functions, create separate Aggregator transformations.

Conditional Clauses
We use conditional clauses in the aggregate expression to reduce the number of rows used in the aggregation. The conditional clause can be any clause that evaluates to TRUE or FALSE.

Example: The following expression is to calculate the total commissions of employees who exceeded their quarterly quota:
          SUM( COMMISSION, COMMISSION > QUOTA )

Non-Aggregate Functions
We can also use non-aggregate functions in the aggregate expression.

Example: The following expression returns the highest number of items sold for each item (grouped by item). If no items were sold, the expression returns 0.
          IIF( MAX( QUANTITY ) > 0, MAX( QUANTITY ), 0))

Null Values in Aggregate Functions
To handle the Null values in Aggregate functions you have to configure the Integration Service. You can choose to treat null values in aggregate functions as NULL or zero. By default, the Integration Service treats null values as NULL in aggregate functions.

The Aggregator transformation lets you define groups for aggregations, rather than performing the aggregation across all input data. The Group by port is the one which indicate the Integration Service how to create group.

Example: Finding the total company sales, you can find the total sales grouped by region.

To define a group for the aggregate expression, go to port tab and select the appropriate input, input/output, output, and variable ports and select it as group by in the Aggregator transformation.

The Integration Service produces one row for each group. If you do not group values, the Integration Service returns one row for all input rows. The Integration Service typically returns the last row of each group (or the last row received).

4. Using Sorted Input
We use Sorted Input to improve session performance, If you want to use sorted input we must sort the input data by group by port in ascending or descending order before we are passing it to aggregator transformation.

By using this sorted input option we are telling to aggregator transformation that we have already sorted incoming data by group by port.

To use this option "Go the Port Tab" and select "Sorted Input".

In-case you have checked sorted input option and the input data is not sorted before passing to it. Then the session will get failed.

Do not use sorted input if either of the following conditions are true:
  • The aggregate expression uses nested aggregate functions.
  • The session uses incremental aggregation.
If you use sorted input and do not sort data correctly, the session fails.

Note:
In case Flat File sources, use the Sorter transformation to sort data in the mapping before passing it to the Aggregator transformation.

In case relational source you can include the order by clause in sql override in source qualifier or you can also use the number of sorted port in source qualifier.

Sample Mapping:
We will be create a mapping using EMP table from scott to get SUM_OF_SALARY, MAX_SALARY, MIN_SALARY, COUNT_OF_EMPLOYEE in each department.

1. Connect and Open the folder if not already opened.

2. Select Tools --> Mapping Designer
 
3. Select Mappings --> Create

It will pop-up "Mapping Name". Enter the mapping name of your choice" and Click on 'OK'. Example: " m_emp_salary_aggr".
4. Drag the Source and Target definitions into workspace if they are already exist. If not click here to know how to create or import Table definitions.
Source: Emp
Target: EMP_SALAY_AGGR 
5. To create 'Aggregator Transformation' Go to 'Transformation' from Menu --> Create
a) That will appear you 'Select the transformation type to create:'
b) Select 'Aggregator' from drop down and 'Enter a new name for this transformation:' as "aggr_emp_salary"
c) Click 'Create' and 'Done'
 
or


a)    Click on Aggregator Transformation icon marked below in below snapshot.
b) Click in the workspace in Mapping Designer.
 
 c) Select AGGTRANS in workspace and Right Click --> Edit.
d) In Transformation tab --> Click on 'Rename' highlighted above which will pop-up 'Rename Transformation'. Enter the Transformation Name: "aggr_emp_salary"
e) Click on 'OK'
f) Click on 'Apply' and 'OK'.
6). Drag required ports from 'SQ_EMP' Source Qualifier to 'aggr_emp_salary' Aggregator Transformation.
 7) Select 'aggr_emp_salary' Aggregator Transformation and Right Click --> Edit --> Ports Tab.
a) Create 4 output ports SUM_OF_SALARY, MAX_SALARY, MIN_SALARY and COUNT_OF_EMPLOYEE
 b) Open Expression Edit for "SUM_OF_SALARY"
Click 'Ok'

c) Open Expression Edit for " MAX_SALARY"
Click 'Ok'

d) Open Expression Edit for "MIN_SALARY"
Click 'Ok'

e) Open Expression Edit for "COUNT_OF_EMPLOYEE"
Click 'Ok' 

f) Select 'Group By' for Deptno
 Click 'Apply' and 'Ok'.

8). Drag required port from 'aggr_emp_salary' to EMP_SALAY_AGGR' Target port as below:
 9) Select 'Mapping' from Menu --> Validate.
10) Select 'Repository' from Menu --> Save. 

Note:
The Integration Service uses memory to process an Aggregator transformation with sorted ports. The Integration Service does not use cache memory. You do not need to configure cache memory for Aggregator transformations that use sorted ports.

Tips:
  1. Sorted input reduces the amount of data cached during the session and improves session performance. Use this option with the Sorter transformation to pass sorted data to the Aggregator transformation.
  2. Limit the number of connected input/output or output ports to reduce the amount of data the Aggregator transformation stores in the data cache.
If you use a Filter transformation in the mapping, place the transformation before the Aggregator transformation to reduce unnecessary aggregation.

16 comments:

  1. very informative...

    ReplyDelete
  2. Thank u so much for the detailed and nice explanation...

    ReplyDelete
  3. Good post! Thanks for sharing this information I appreciate it. God bless!

    aggregate spend

    ReplyDelete
  4. All the major points have been covered....Many thanks..nice one

    ReplyDelete
  5. Thanks a lot .. Articles are all really good.. Appreciate your help.

    ReplyDelete
  6. GOWTHAM ITS VERY NICE YAAR

    ReplyDelete
  7. Hi GOWTHAM
    im new to informatica could help me out how to use first ,last in aggregator with small example

    regard's
    Reavan

    ReplyDelete
  8. Hi Reavan,

    Generally we use this first and last functions based on the requirement. By default informatica while return last value from the group.
    By using the First function is used to get the first value from the group and the Last function is used to get the Last value from the group. If you want you can use filter condition for that group. This is an optional.

    Thanks,
    Gowtham

    ReplyDelete
  9. Hi Gowtham,

    Please share the document with above details so that its easy to read it offline.

    Also , we are unable to open the images above.

    Thanks
    Avinash

    ReplyDelete
    Replies
    1. Hi Avyankush,

      If your using chrome then you save all this pages into pdf.

      Delete
  10. Hi Gowtham Sir,

    If u disable the cache type and write any query what happen at that time

    if u not select any group by port then aggregatore what return to the another transformation

    ReplyDelete
    Replies
    1. Hi Siva Kumar,

      Your 1st question is not clear to me.

      If you have not selected any group by ported then the aggregator will return the last row from the input records.

      Delete
  11. Hi ,
    max(count(product)) is valid -- How is this valid ?

    ReplyDelete