Monday, October 22, 2012

Informatica Powercenter Partitioning


Informatica Powercenter Partitioning

Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing.
This option provides a thread-based architecture and automatic data
partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Introduction:
With the Partitioning Option, you can execute optimal parallel sessions by dividing data
processing into subsets that are run in parallel and spread among available CPUs in a multiprocessor system. When different processors share the computational load,large data volumes can be processed faster.
When sourcing and targeting relational databases, the Partitioning Option enables PowerCenter to automatically align its partitions with database table partitions to improve performance. Unlike approaches that require manual data partitioning, data integrity is automatically guaranteed because the parallel engine of PowerCenter dynamically
realigns data partitions for set-oriented transformations (e.g., aggregators or sorters).

Detail description:
Every mapping contains one or more source pipelines.A source pipeline consists of a source qualifier and all the transformations and targets that receive data from that source qualifier.
What is a partition?
A partition is a pipeline stage that executes in a single (reader, transformation, or writer) thread.
Partitioning for a pipeline is specified by setting the following attributes

1.Location of partition points(where to use?)
Partition points mark the thread boundaries in a pipeline and divide the pipeline into stages.
 A stage is a section of a pipeline between any two partition points. When you set partition point at a transformation, the new pipeline stage includes that transformation.
The Integration service sets partition points at several transformations in a pipeline by default.
When you add partition points, you increase the number of transformation threads,which can improve session performance.
Integration service can redistribute rows of data at partition points, which can also improve session performance.

2.Number of partitions(how many?)
With PowerCenter,Number of partitions can be set at any partition point.
By default, the Informatica Server defines a single partition in the source pipeline.
When we add partitions, there is increase in the number of processing threads, which can improve session performance.

3.Partition types(what type to go for?)
Integration service specifies a default partition type at each partition point.
The partition type controls how the Informatica Server redistributes data among partitions at partition points.
When you configure the partitioning information for a pipeline, you must specify a partition type at each partition point in the pipeline.

The Workflow Manager allows you to specify the following partition types:

Round-robin partitioning:The Integration service distributes data evenly among all partitions. Use round-robin partitioning where you want each partition to process approximately the same number of rows.

Hash partitioning:The Integration service applies a hash function to a partition key to group data among partitions. If you select hash auto-keys, the Integration service uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. Use hash partitioning where you want to ensure that the Integration service processes groups of rows with the same partition key in the same partition.

Key range partitioning:You specify one or more ports to form a compound partition key.The Integration service passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range.

Pass-through partitioning:The Integration service passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

You can specify different partition types at different points in the pipeline.

Example:
Consider a mapping reads data about items and calculates average wholesale costs and prices.The mapping must read item information from three flat files of various sizes, and then filter outdiscontinued items. It sorts the active items by description, calculates the average prices and wholesale costs, and writes the results to a relational database in which the target tables are partitioned by key range.
When you use this mapping in a session, you can increase session performance by specifying different partition types at the following partition points in the pipeline:
• Source qualifier.
To read data from the three flat files concurrently, you must specify three partitions at the source qualifier. Accept the default partition type, pass-through.
• Filter transformation.
Since the source files vary in size, each partition processes a different amount of data. Set a partition point at the Filter transformation, and choose round-robin partitioning to balance the load going into the Filter transformation.
• Sorter transformation.
To eliminate overlapping groups in the Sorter and Aggregator transformations, use hash auto-keys partitioning at the Sorter transformation. This causes the integration service to group all items with the same description into the same partition before the Sorter and Aggregator transformations process the rows. You can delete the default partition point at the Aggregator transformation.
• Target.
Since the target tables are partitioned by key range, specify key range partitioning at the target to optimize writing data to the target.

No comments:

Post a Comment