Skip to main content
Pentaho Documentation

Using a Job Entry to Load Data into Hive

In order to follow along with this tutorial, you will need
  • Hadoop
  • Pentaho Data Integration
  • Hive

PDI jobs can be used to put files into Hive from many different sources. This tutorial instructs you how to use a PDI job to load a sample data file into a Hive table.

Note: Hive could be defined with external data. Using the external option, you could define a Hive table that uses the HDFS directory that contains the parsed file. For this tutorial, we chose not to use the external option to demonstrate the ease with which files can be added to non-external Hive tables.

If not already running, start Hadoop, PDI, and the Hive server. Unzip the sample data files and put them in a convenient location: weblogs_parse.txt.zip.

This file should be placed in the /user/pdi/weblogs/parse directory of HDFS using these three commands.

hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

If you previously completed the Using Pentaho MapReduce to Parse Weblog Datatutorial, the necessary files will already be in the proper directory.

  1. Create a Hive Table.
    1. Open the Hive shell by entering 'hive' at the command line.
    2. Create a table in Hive for the sample data by entering
      create table weblogs (
      client_ip    string,
      full_request_date string,
      day    string,
      month    string,
      month_num int,
      year    string,
      hour    string,
      minute    string,
      second    string,
      timezone    string,
      http_verb    string,
      uri    string,
      http_status_code    string,
      bytes_returned        string,
      referrer        string,
      user_agent    string)
      row format delimited
      fields terminated by '\t';                            
    3. Close the Hive shell by entering 'quit'.
  2. Create a new Job to load the sample data into a Hive table by selecting File > New > Job.
  3. Add a Start job entry to the canvas. From the Design palette on the left, under the General folder, drag a Start job entry onto the canvas. File:/loading_data_into_hdfs_step2.png
  4. Add a Hadoop Copy Files job entry to the canvas. From the Design palette, under the Big Data folder, drag a Hadoop Copy Files job entry onto the canvas. File:/loading_data_into_hdfs_step3.png
  5. Connect the two job entries by hovering over the Start entry and selecting the output connector File:/loading_data_into_hdfs_step4a.png, then drag the connector arrow to the Hadoop Copy Files entry. File:/loading_data_into_hdfs_step4.png
  6. Enter the source and destination information within the properties of the Hadoop Copy Files entry by double-clicking it.
    1. For File/Folder source(s), enter hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/parse, where NAMENODE and PORT reflect your Hadoop destination.
    2. For File/Folder destination(s), enter hdfs://<NAMENODE>:<PORT>/user/hive/warehouse/weblogs.
    3. For Wildcard (RegExp), enter part-.*.
    4. Click the Add button to add the entries to the list of files to copy.

    When you are done your window should look like this (your file paths may be different)

    File:/loading_data_into_hive_step6_result.png

    Click OK to close the window.

  7. Save the job by selecting Save as from the File menu. Enter load_hive.kjb as the file name within a folder of your choice.
  8. Run the job by clicking the green Run button on the job toolbar File:/loading_data_into_hive_result_run.png, or by selecting Action > Run from the menu. The Execute a job window opens. Click Launch.

    An Execution Results panel opens at the bottom of the Spoon interface and displays the progress of the job as it runs. After a few seconds the job finishes successfully.

    File:/loading_data_into_hive_result.png

    If any errors occurred the job entry that failed will be highlighted in red and you can use the Logging tab to view error messages.

  9. Verify the data was loaded by querying Hive.
    1. Open the Hive shell from the command line by entering hive.
    2. Enter this query to very the data was loaded correctly into Hive.
      select * from weblogs limit 10;
Ten rows of data are returned.