This workflow helps you to set up and configure the DI development and test environments, then build, test, and tune your Pentaho DI Solution prototype. This process is similar to the Trial Download Evaluation experience, except that you will be completely configuring the Pentaho Server for data integration and working with your own ETL developers. If you need extra help, Pentaho professional services is available. The end result of this is to learn DI implementation best practices and deploy your DI solution to a production server. Most development and testing for DI occurs in Spoon.
Before you begin developing your DI solution, we recommend that you attend Pentaho training classes to learn how to install and configure the Pentaho Server, as well as how to develop data models.
This section is grouped into parts that will guide you during the development of your DI solution. These parts are iterative and you might bounce between them during development. For example as you tune a job, you might find that although you have built a solution that produces the right results, it takes a long time to run. So, you might need to rebuild and test a transformation to improve efficiency, and then retest it.
Design DI Solution
Design helps you think critically about the problem you want to solve and possible solutions. Consider these questions as you gather your requirements and design the solution.
Output: What does the overall solution look like? What questions are posing and how do you want the answers formatted?
Data Sources: What type(s) of data sources are you querying? Where are they located? How much data do you need to process? Are you using Big Data? Are you using relational or non-relational data sources? Will you have a target data source? If so, where are they located?
Content/Processing: What data quality issues do you have? How is the input data mapped to the output data? Where do you want to process the content, in PDI or in the data source? What hardware will you include in your development environment? Will you need one or more quality assurance test environments or production environments?
Also, consider templates or standards, naming conventions, and other requirements of your end users if you have them. Consider how you will back up your data as well.
Set Up Development Environment
Setting up the environment includes installing and configuring PDI on development computers, configuring clustering if needed, and connecting to data sources. If you have one or more quality assurance environments, you will need to set those up also.
|Verify System Requirements|| |
|Obtain Software and Install PDI|| |
|Install licenses for the Pentaho Server|| |
|Connect to the Pentaho Repository|| |
|Apply Advanced Security (if needed)|| || |
Build and Test Solution
During this step, you develop transformations, jobs, and models, then test what you have developed. You will tune the transformations, jobs, and models for optimal performance.
Development occurs in the Spoon design tool. Spoon’s streamlined design tightly couples the build and test activities so that you can easily perform them iteratively. Spoon has perspectives help you perform ETL and visualize data. Spoon also provides a scheduling perspective that can be used to automate testing. Testing encompasses verifying the quality of transformations and jobs, reviewing visualizations, and debugging issues. One common method of testing is to include steps in a transformation or job that calculate hash totals, checksums, record counts, and so forth to determine whether data is being properly processed. You can also visualize your data in analyzer and report designer and review the results as you develop. This can not only help you find errors and issues with processing, but can help you get a jump on user acceptance testing if you show these reports to your customers or business analysts to get early feedback.
One basic question, is how to determine the numbers of transformations and jobs needed, as well as the order in which they should be executed. A good rule of thumb is to create one transformation for each combination of source system and target tables. You can often identify combinations in your mapping documents. Once you've identified the number of transformations that you need, you can use the same process to determine that number of jobs that you need. When considering the order of execution for transformations and jobs, consider how referential integrity is enforced. Run target table transformations that have no dependencies first, then run transformations that are depend on those tables next, and so forth.
|Understand the Basics|| || |
|Review most often used steps and entries|| || |
|Create and Run Transformations|| |
|Create and Run a Job|| |
Fine tune transformations and jobs to optimize performance. This involves using various tools such as the DI Operation and Audit Mart to determine where bottlenecks or other performance issues occur, and addressing them.
|Review the Performance Tuning Checklist and Make Changes to Transformations and Jobs|| || |
|Consider other performance tuning options|| |