Sunday, March 31, 2013

Expression Transformation

The Expression Transformation is an Passive and Connected Transformation.

The Expression transformation to calculate values in a single row before it writes into Target.

For example, you might need to adjust employee salaries, concatenate first and last names, or convert strings to numbers. You can also use the Expression transformation to test conditional statements before you pass the results to a target or other transformations.

The Expression transformation is use to perform non-aggregate calculations.

Task can be performed using Expression Transformation:
The transformation language provides the following types of functions:
  •  Aggregate - Only Row level Addition, Subtraction, Multiplication and Division
  • Character
  • Conversion
  • Data Cleansing
  • Date
  • Encoding
  • Financial
  • Numerical
  • Scientific
  • Special
  • String
  • Test
  • Variable
Generally we use Expression transformation for Test, Character and Conversion function like IF...ELSE..THEN,  DECODE, ISNULL, IS_DATE, IS_SPACES, IS_NUMBER, etc..

The Expression Transformation having 3 types of port:
  • Input: The Input port is that which received data directly from its before transformation.
  • Output: The Output port is that which return value of the expression or a Hard coded Value
  • Variable: The Variable port is that Temporary value of a current row or a Hard coded value.
The Variable port and Output port are used to perform a expression for each record passing through it. We can have multiple expressions in single expression transformation. We can have 'N' no of Variable and Output Port in single expression transformation.

Example: Create a mapping to sum salary and commission for each employee. (Row level calculation)

Sample Mapping:
1. Connect and Open the folder if not already opened.
2. Select Tools --> Mapping Designer
3. Select Mappings --> Create

It will pop-up "Mapping Name". Enter the mapping name of your choice" and Click on 'OK'. Example: m_emp_salary

4. Drag the Source and Target definitions into workspace if they are already exist. If not click here to know how to create or import Tabledefinitions.

5. Select 'Transformation' from Menu --> Create
a)    That will appear you 'Select the transformation type to create:'

 b)    Select 'Expression' from drop down and 'Enter a new name for this transformation:' as "exp_employee_salary"
c)    Click 'Create' and 'Done'

or
  
a)    Click on Expression Transformation icon marked below in below snapshot.

 b)    Click in the workspace in Mapping Designer.

 c)    Select EXPTRANS in workspace and Right Click --> Edit.

d)    In Transformation tab --> Click on 'Rename' highlighted above which will pop-up 'Rename Transformation'. Enter the Transformation Name: "exp_employee_salary"
e)    Click on 'OK'
f)    Click on 'Apply' and 'OK'.

6). Drag required ports from 'SQ_EMP' Source Qualifier to 'exp_employee_salary' Expression Transformation.

7) Select 'exp_employee_salary' Expression Transformation and Right Click --> Edit --> Ports Tab.

a) Un-Check Output Ports for SAL and COMM

Before: See the port highlighted in red.
After: See the ports highlighted in red.

b) Add the Variable port to expression transformation.
Note: Use the below 'Icon' to handle the port as commented.



·         Add a new port to the transformation and name it 'v_Salary' and check 'Variable check box' as shown above
·         Click on 'Open Browser' and Write a Expression in "Formula:" box like 'SAL+COMM' or First_Name||Last_Name.
·         Click on 'Validate' to validate function and Click on 'OK'


c) Add Output port to expression transformation.

  • Add a new port to the transformation and name it 'o_Salary' and check 'Output check box' as shown above
  • Click on 'Open Browser' and Write a Expression in "Formula:" box like 'v_Salary' which is variable value
  • Click on 'Validate' to validate function and Click on 'OK'
d) Click on 'Apply' and Click on 'OK'.

8). Drag required port from 'exp_employee_salary' expression to 'EMP_DEPT_FILE' Target as below:
 
9) Select 'Mapping' from Menu --> Validate.

10) Select 'Repository' from Menu --> Save.


Note:
You make the transformation reusable; you cannot copy ports from the source qualifier or other transformations. You can create ports manually within the transformation.

Monday, March 25, 2013

Mapping Optimization Techniques

1. Reduce the number of transformations. There is always overhead involved in moving data between transformations.

2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice.

3. Calculate once, use many times.
  • Avoid calculating or testing the same value over and over.
  • Calculate it once in an expression, and set a True/False flag.
  • Within an expression, use variable ports to calculate a value than can be used multiple times within that transformation.
  • Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier.
  • This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect.
  • In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.
  • The engine automatically converts compatible types.
  • Sometimes data conversion is excessive. Data types are automatically converted when types are different between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping.
  • Plan for reusable transformations upfront.
  • Use variables. Use both mapping variables as well as ports that are variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times
  • Use mapplets to encapsulate multiple reusable transformations.
  • Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions.
  • Reduce the number of non-essential records that are passed through the entire mapping.
  • Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source      as possible).
  • Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join.
  • Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier.
  • Remove or reduce field-level stored procedures.
  • If you use field-level stored procedures, the PowerCenter server has to make a call to that stored procedure for every row, slowing performance.

4. Only connect what is used.

5. Watch the data types.

6. Facilitate reuse.

7. Only manipulate data that needs to be moved and transformed.

8. Utilize single-pass reads.

9. Sort the input data before passing to Joiner and Aggregate transformation.

10. In Lookup using customize query instead of default query. (Use '--' to overwrite lookup default order by clause).

11. Avoid using un-neccessary columns/port in sql query.

12. Filter un-neccessary data as closer to the source qualifier. (In case of Relational database include filter condition to the sql query).

13. In Joiner consider lesser value of data as Master Table.

14. In-case of mapping partition place aggregate transformation before the partition point.

15. Use Router instead of having multiple Filter transformations.


Tools included in Informatica 9.0.1

Informatica 9.0.1 includes:
  • Informatica Data Services
  • Informatica Data Quality [includes Informatica Analyst]
  • Informatica Data Explorer
  • Informatica Data Explorer Advanced Edition
  • PowerCenter [Standard Edition, Real Time Edition, Advanced Edition]
  • PowerExchange [including Neoview and Netezza]
  • Metadata Manager
Following are the highlight of this release:
DI Tools and Platform
  • Platform – Theme: Production Readiness
    • Upgrade support from PowerCenter (8.1.x, 8.5.x, 8.6.x, 9.0.0)
    • I18N Support – Simultaneous release in Brazilian Portuguese
    • Additional connectivity – VSAM, IMS, SAP Table Reader – for IDS & IDQ/IDE
    • Parameterization (only for deployed objects)
    • Data source simplification
    • Metadata Reporting
    • License Compliance
  • IDS – Theme: Competitive Parity
    • Pass-through Security
    • Column-level Security
    • Unstructured Data Support (UDO)
  • PowerExchange
    • CDC Support for Oracle Transparent Data Encryption and Oracle Materialized Views
    • Performance improvements
  • Metadata Manager & Business Glossary – Ready for prime time
    • BG Change notifications through email
    • Impact Summary (fka where-used analysis)
      Lineage diagram enhancements
    • Export and import of XML files for resource configuration and content
      Command line for all relevant UI actions
    • Improved Performance, scalability, longevity
Data Quality
  • IDQ – Theme: Production readiness
    • IDQ Upgrade Utility – support for migrating plans/mappings from 8.6.2 or 9.0.0 to IDQ 9.0.1
    • GAV SERP & CASS Certified (limited platform support)
    • GAV Performance improvements: Multi-threaded AV engine
      Decision Transformation (if/then/else)
    • Object Import/Export improvements (UI based, RTM support)
    • Parameterization of Reference Tables (only for deployed objects)
  • IDE – Theme: Production readiness for IDE AE
    • Profiling performance improvements
      • ‘Live’ Drilldown of Rules, Scorecards
      • Auto purge of profiling results
      • ‘Incremental’ Profiling
    • Profiling of SAP, VSAM and IMS sources
    • Grouping of scorecards
    • Upgrade support for profiles/scorecards from 9.0.0 to 9.01
 

FEATURED: Informatica 9 - How does it change the dynamics of IT Projects?

Here are few new features in power center 9

PowerCenter
Effective in version 9.0, PowerCenter contains new features and enhancements.

Session log file rollover: You can limit the size of session logs for real-time sessions. You can limit the size by time or by file size. You can also limit the number of log files for a session.

Licensing: Enforcement of licensing restrictions. PowerCenter will enforce the licensing restrictions on the number of CPUs and repositories.

Lookup Transformation: Cache updates. You can update the lookup cache based on the results of an expression. When an expression is true, you can add to or update the lookup cache. You can update the dynamic lookup cache with the results of an expression.

Multiple rows return: You can configure the Lookup transformation to return all rows that match a lookup condition. A Lookup transformation is an active transformation when it can return more than one row for any given input row.
Database deadlock resilience: In previous releases, when the Integration Service encountered a database deadlock during a lookup, the session failed. Effective in 9.0, the session will not fail. When a deadlock occurs, the Integration Service attempts to run the last statement in a lookup. You can configure the number of retry attempts and time period between attempts.

SQL overrides for uncached lookups: In previous versions you could create a SQL override for cached lookups only. You can create an SQL override for uncached lookup. You can include lookup ports in the SQL query.
 
Passive transformation: We can configure the SQL transformation to run in passive mode instead of active mode. When the SQL transformation runs in passive mode, the SQL transformation returns one output row for each input row.

Connection management: Database connections are centralized in the domain. We can create and view database connections in Informatica Administrator, Informatica Developer, or Informatica Analyst. Create, view, edit, and grant permissions on database connections in Informatica Administrator.

Monitoring: We can monitor profile jobs, scorecard jobs, preview jobs, mapping jobs, and SQL Data Services for each Data Integration Service. View the status of each monitored object on the Monitoring tab of Informatica Administrator.

PowerCenter Integration Service Architecture

The PowerCenter Integration Service moves data from sources to targets based on PowerCenter workflow, session and mapping related metadata stored in a PowerCenter repository. When a workflow starts, the PowerCenter Integration Service retrieves mapping, session and workflow related metadata from the repository. 

It extracts data from the mapping sources and stores the data in memory while it applies the transformation rules configured in the mapping. The PowerCenter Integration Service loads the transformed data into one or more targets.

To move data from sources to targets, the PowerCenter Integration Service uses components are PowerCenter Integration Service Process, Load Balancer and Data Transformation Manager process (DTM): 

·         PowerCenter Integration Service process: The PowerCenter Integration Service starts one or more PowerCenter Integration Service processes to run and monitor workflows. When a workflow run, the PowerCenter Integration Service process starts and locks the workflow, runs the workflow tasks, and starts the process to run sessions.
·         Load Balancer: The PowerCenter Integration Service uses the Load Balancer to dispatch tasks. The Load Balancer dispatches tasks to achieve optimal performance. It may dispatch tasks to a single node or across the nodes in a grid.
·         Data Transformation Manager (DTM) process: The PowerCenter Integration Service starts a DTM process to run each Session and Command task within a workflow. The DTM process performs session validations, creates threads to initialize the session, read, write, and transform data, and handles pre- and post- session operations.


 PowerCenter Integration Service Connectivity:
The PowerCenter Integration Service is a repository client. It connects to the PowerCenter Repository Service to retrieve workflow and mapping metadata from the repository database. When the PowerCenter Integration Service process requests a repository connection, the request is routed through the master gateway, which sends back PowerCenter Repository Service information to the PowerCenter Integration Service process. The PowerCenter Integration Service process connects to the PowerCenter Repository Service. The PowerCenter Repository Service connects to the repository and performs repository metadata transactions for the client application.  

The PowerCenter Workflow Manager communicates with the PowerCenter Integration Service process over a TCP/IP connection. The PowerCenter Workflow Manager communicates with the PowerCenter Integration Service process each time you schedule or edit a workflow, display workflow details, and request workflow and session logs. Use the connection information defined for the domain to access the PowerCenter Integration Service from the PowerCenter Workflow Manager.  

The PowerCenter Integration Service process connects to the source or target database using ODBC or native drivers. The PowerCenter Integration Service process maintains a database connection pool for stored procedures or lookup databases in a workflow. The PowerCenter Integration Service process allows an unlimited number of connections to lookup or stored procedure databases. If a database user does not have permission for the number of connections a session requires, the session fails. You can optionally set a parameter to limit the database connections. For a session, the PowerCenter Integration Service process holds the connection as long as it needs to read data from source tables or write data to target tables.

PowerCenter Integration Service:
The PowerCenter Integration Service is an application service that runs sessions and workflows.

Integration Service Process:
The PowerCenter Integration Service starts a PowerCenter Integration Service process to run and monitor workflows. The PowerCenter Integration Service process is also known as the pmserver process. The PowerCenter Integration Service process accepts requests from the PowerCenter Client and from pmcmd. 

It performs the following tasks:
·         Manage workflow scheduling.
·         Lock and read the workflow.
·         Read the parameter file.
·         Create the workflow log.
·         Run workflow tasks and evaluates the conditional links connecting tasks.
·         Start the DTM process or processes to run the session.
·         Write historical run information to the repository.
·         Send post-session email in the event of a DTM failure.

Load Balancer:
The Load Balancer is the object of the PowerCenter Integration Service and that dispatches tasks to achieve optimal performance and scalability. When you run a workflow, the Load Balancer dispatches the Session, Command, and predefined Event-Wait tasks within the workflow. The Load Balancer matches task requirements with resource availability to identify the best node to run a task. It dispatches the task to a PowerCenter Integration Service process running on the node. It may dispatch tasks to a single node or across nodes.

The Load Balancer dispatches tasks in the order it receives them. When the Load Balancer needs to dispatch more Session and Command tasks than the PowerCenter Integration Service can run, it places the tasks it cannot run in a queue. When nodes become available, the Load Balancer dispatches tasks from the queue in the order determined by the workflow service level.

The Load Balancer functionality:
·         Dispatch process: The Load Balancer performs several steps to dispatch tasks.
·         Resources: The Load Balancer can use PowerCenter resources to determine if it can dispatch a task to a node.
·         Resource provision thresholds: The Load Balancer uses resource provision thresholds to determine whether it can start additional tasks on a node.
·         Dispatch mode: The dispatch mode determines how the Load Balancer selects nodes for dispatch.
·         Service levels: When multiple tasks are waiting in the dispatch queue, the Load Balancer uses service levels to determine the order in which to dispatch tasks from the queue.

Data Transformation Manager (DTM) Process
The PowerCenter Integration Service process starts the DTM process to run a session. The DTM process is also known as the pmdtm process. The DTM is the process associated with the session task.

Read the Session Information: The PowerCenter Integration Service process provides the DTM with session instance information when it starts the DTM. The DTM retrieves the mapping and session metadata from the repository and validates it.
Perform Pushdown Optimization: If the session is configured for pushdown optimization, the DTM runs an SQL statement to push transformation logic to the source or target database.
Create Dynamic Partitions: The DTM adds partitions to the session if you configure the session for dynamic partitioning. The DTM scales the number of session partitions based on factors such as source database partitions or the number of nodes in a grid.
Form Partition Groups: If you run a session on a grid, the DTM forms partition groups. A partition group is a group of reader, writer, and transformation threads that runs in a single DTM process. The DTM process forms partition groups and distributes them to worker DTM processes running on nodes in the grid.
Expand Variables and Parameters: If the workflow uses a parameter file, the PowerCenter Integration Service process sends the parameter file to the DTM when it starts the DTM. The DTM creates and expands session-level, service-level, and mapping-level variables and parameters.
Create the Session Log: The DTM creates logs for the session. The session log contains a complete history of the session run, including initialization, transformation, status, and error messages. You can use information in the session log in conjunction with the PowerCenter Integration Service log and the workflow log to troubleshoot system or session problems.
Validate Code Pages: The PowerCenter Integration Service processes data internally using the UCS-2 character set. When you disable data code page validation, the PowerCenter Integration Service verifies that the source query, target query, lookup database query, and stored procedure call text convert from the source, target, lookup, or stored procedure data code page to the UCS-2 character set without loss of data in conversion. If the PowerCenter Integration Service encounters an error when converting data, it writes an error message to the session log.
Verify Connection Object Permissions: After validating the session code pages, the DTM verifies permissions for connection objects used in the session. The DTM verifies that the user who started or scheduled the workflow has execute permissions for connection objects associated with the session.
Start Worker DTM Processes: The DTM sends a request to the PowerCenter Integration Service process to start worker DTM processes on other nodes when the session is configured to run on a grid.
Run Pre-Session Operations: After verifying connection object permissions, the DTM runs pre-session shell commands. The DTM then runs pre-session stored procedures and SQL commands.
Run the Processing Threads: After initializing the session, the DTM uses reader, transformation, and writer threads to extract, transform, and load data. The number of threads the DTM uses to run the session depends on the number of partitions configured for the session.
Run Post-Session Operations: After the DTM runs the processing threads, it runs post-session SQL commands and stored procedures. The DTM then runs post-session shell commands.
Send Post-Session Email: When the session finishes, the DTM composes and sends email that reports session completion or failure. If the DTM terminates abnormally, the PowerCenter Integration Service process sends post-session email.

Processing Threads
The DTM allocates process memory for the session and divides it into buffers. This is also known as buffer memory. The DTM uses multiple threads to process data in a session. The main DTM thread is called the master thread.

The different types of master threads creates for a session:
·         Mapping threads
The master thread creates one mapping thread for each session. The mapping thread fetches session and mapping information, compiles the mapping, and cleans up after session execution. 

·         Pre- and post-session threads
The master thread creates one pre-session and one post-session thread to perform pre- and post-session operations. 

·         Reader threads
The master thread creates reader threads to extract source data. The number of reader threads depends on the partitioning information for each pipeline. The number of reader threads equals the number of partitions. Relational sources use relational reader threads, and file sources use file reader threads.

The PowerCenter Integration Service creates an SQL statement for each reader thread to extract data from a relational source. For file sources, the PowerCenter Integration Service can create multiple threads to read a single source. 

·         Transformation threads
The master thread creates one or more transformation threads for each partition. Transformation threads process data according to the transformation logic in the mapping. 

The master thread creates transformation threads to transform data received in buffers by the reader thread, move the data from transformation to transformation, and create memory caches when necessary. The number of transformation threads depends on the partitioning information for each pipeline. 

Transformation threads store transformed data in a buffer drawn from the memory pool for subsequent access by the writer thread. 

If the pipeline contains a Rank, Joiner, Aggregator, Sorter, or a cached Lookup transformation, the transformation thread uses cache memory until it reaches the configured cache size limits. If the transformation thread requires more space, it pages to local cache files to hold additional data. 

When the PowerCenter Integration Service runs in ASCII mode, the transformation threads pass character data in single bytes. When the PowerCenter Integration Service runs in Unicode mode, the transformation threads use double bytes to move character data.

·         Writer threads
The master thread creates writer threads to load target data. The number of writer threads depends on the partitioning information for each pipeline. If the pipeline contains one partition, the master thread creates one writer thread. If it contains multiple partitions, the master thread creates multiple writer threads.

Each writer thread creates connections to the target databases to load data. If the target is a file, each writer thread creates a separate file. You can configure the session to merge these files.

If the target is relational, the writer thread takes data from buffers and commits it to session targets. When loading targets, the writer commits data based on the commit interval in the session properties. You can configure a session to commit data based on the number of source rows read, the number of rows written to the target, or the number of rows that pass through a transformation that generates transactions, such as a Transaction Control transformation.

Grids
When you run a PowerCenter Integration Service on a grid, a master service process runs on one node and worker service processes run on the remaining nodes in the grid. The master service process runs the workflow and workflow tasks, and it distributes the Session, Command, and predefined Event-Wait tasks to itself and other nodes. A DTM process runs on each node where a session runs. If a session run on a grid, a worker service process can run multiple DTM processes on different nodes to distribute session threads.

Code Pages and Data Movement Modes

The PowerCenter Integration Service can move data in either ASCII or Unicode data movement mode. These modes determine how the PowerCenter Integration Service handles character data. You choose the data movement mode in the PowerCenter Integration Service configuration settings. If you want to move multibyte data, choose Unicode data movement mode. To ensure that characters are not lost during conversion from one code page to another, you must also choose the appropriate code pages for your connections.

ASCII Data Movement Mode
In ASCII data movement mode when all sources and targets are 7-bit ASCII or EBCDIC character sets. In ASCII mode, the PowerCenter Integration Service recognizes 7-bit ASCII and EBCDIC characters and stores each character in a single byte.

Unicode Data Movement Mode

Use Unicode data movement mode when sources or targets use 8-bit or multibyte character sets and contain character data. In Unicode mode, the PowerCenter Integration Service recognizes multibyte character sets as defined by supported code pages.