Skip to main content
Pentaho Documentation

Using a Job Entry to Load Data into Hadoop's Distributed File System (HDFS)

In order to follow along with this tutorial, you will need
  • Hadoop
  • Pentaho Data Integration

You can use PDI jobs to put files into HDFS from many different sources. This tutorial describes how to create a PDI job to move a sample file into HDFS.

If not already running, start Hadoop and PDI. Unzip the sample data files and put them in a convenient location: weblogs_rebuild.txt.zip.

  1. Create a new Job by selecting File > New > Job.
  2. Add a Start job entry to the canvas. From the Design palette on the left, under the General folder, drag a Start job entry onto the canvas. File:/loading_data_into_hdfs_step2.png
  3. Add a Hadoop Copy Files job entry to the canvas. From the Design palette, under the Big Data folder, drag a Hadoop Copy Files job entry onto the canvas. File:/loading_data_into_hdfs_step3.png
  4. Connect the two job entries by hovering over the Start entry and selecting the output connector File:/loading_data_into_hdfs_step4a.png, then drag the connector arrow to the Hadoop Copy Files entry. File:/loading_data_into_hdfs_step4.png
  5. Enter the source and destination information within the properties of the Hadoop Copy Files entry by double-clicking it.
    1. For File/Folder source(s), click Browse and navigate to the folder containing the downloaded sample file weblogs_rebuild.txt.
    2. For File/Folder destination(s), enter hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/raw, where NAMENODE and PORT reflect your Hadoop destination.
    3. For Wildcard (RegExp), enter ^.*\.txt.
    4. Click Add to include the entries to the list of files to copy.
    5. Check the Create destination folder option to ensure that the weblogs folder is created in HDFS the first time this job is executed.

    When you are done your window should look like this (your file paths may be different).

    File:/loading_data_into_hdfs_step5.png

    Click OK to close the window.

  6. Save the job by selecting Save as from the File menu. Enter load_hdfs.kjb as the file name within a folder of your choice.
  7. Run the job by clicking the green Run button on the job toolbar File:/loading_data_into_hdfs_result_run.png, or by selecting Action > Run from the menu. The Execute a job window opens. Click Launch.

    An Execution Results panel opens at the bottom of the Spoon interface and displays the progress of the job as it runs. After a few seconds the job finishes successfully.

    File:/loading_data_into_hdfs_step7.PNG

    If any errors occurred the job entry that failed will be highlighted in red and you can use the Logging tab to view error messages.

  8. Verify the data was loaded by querying Hadoop.
    1. From the command line, query Hadoop by entering this command.
      hadoop fs -ls /user/pdi/weblogs/raw
This statement is returned
-rwxrwxrwx 3 demo demo 77908174 2011-12-28 07:16 /user/pdi/weblogs/raw/weblog_raw.txt