Package Deployable Archive to Run Against Hadoop with Hadoop Compiler App

This example shows how to create a deployable archive that calculates mean airline delays. It runs against Hadoop® using the Hadoop Compiler app, which is accessible from deploytool. The archive that you create contains all the MATLAB® content associated with the component. The Hadoop Compiler app generates mcc commands that help you customize to your specification.

This example uses the MaxMapReduceExample.m example file and the airline dataset, airlinesmall.csv, both available at the toolbox/matlab/demos folder. Move your example code to a new working folder for deployment. The new working folder on the path ensures that the files are accessible by MATLAB Compiler™.

    Note:   Deployable archive that runs against Hadoop using Hadoop Compiler app is supported only on Linux®.

  1. Set environment variables and cluster properties for your Hadoop configuration. These properties are necessary for submitting jobs to your Hadoop cluster.

    1. Set up the environment variable, HADOOP_HOME to point at your Hadoop install folder. Modify the system path to include $HADOOP_HOME/bin.

    2. Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster.

      The following example uses /hd-shared/MCR/v84.

      For information on installing the MATLAB Runtime see Install the MATLAB Runtime.

    3. Copy the airlinesmall.csv into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.

    4. Copy the map function maxArrivalDelayMapper.m from toolbox/matlab/demos folder to the working folder.

      function maxArrivalDelayMapper (data, info, intermKVStore)
      partMax = max(data.ArrDelay);
      add(intermKVStore,'PartialMaxArrivalDelay',partMax);

      For more information, see Write a Map Function.

    5. Copy the reduce function maxArrivalDelayReducer.m from toolbox/matlab/demos folder to the working folder.

      function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore)
      maxVal = -inf;
      while hasnext(intermValIter)
         maxVal = max(getnext(intermValIter), maxVal);
      end
      add(outKVStore,'MaxArrivalDelay',maxVal);

      For more information, see Write a Reduce Function.

  2. Create a datastore object from the MaxMapReduceExample.m and save the datastore to a .mat file.

    ds = datastore('airlinesmall.csv','TreatAsMissing','NA',...
         'SelectedVariableNames','ArrDelay','ReadSize',1000);
    
    save('airlinesmall.mat','ds')

    For more information, Getting Started with Datastore

  3. Launch the Hadoop Compiler app through the MATLAB command line or through the apps gallery. At the MATLAB command line type the following command:

    hadoopCompiler

  4. In the Map Function section of the toolstrip, click the plus button to add map file, which contains the map function. Browse and select one map function maxArrivalDelayMapper.m.

  5. In the Reduce Function section of the toolstrip, click the plus button to add reduce file, which contains the reduce function. Browse and select one reduce function maxArrivalDelayReducer.m.

  6. In the Input Types section, select tabulartext as input type. By default, the input type is tabulartext.

  7. In the Output Types section, select tabulartext as output type. By default, the output type is binary.

  8. Rename the application name to maxArrivalDelay.

  9. In the Data store file field, click Browse and select the airlinesmall.mat file, which contains the saved datastore object.

  10. Click Package to build a deployable archive.

    The Hadoop Compiler app creates a log file PackagingLog.txt and two folders for_redistribution and for_testing. The for_redistribution folder contains readme file, shell script run_maxarrivaldelay.sh, and deployable archive maxarrivaldelay.ctf. The for_testing folder contains the same three files and a log file mccExcludedfiles.log.

  11. At the MATLAB command prompt, run the deployable archive against Hadoop using the generated shell script. The arguments in the command are MCRRoot, Hadoop properties defined using -D flag, the data file, and the new results folder. The command to execute the script must be entered as a single line.

    cd maxArrivalDelay/for_testing
    !./run_maxarrivaldelay.sh /hd-shared/MCR/v84
    -D mw.mcrroot = /hd-shared/MCR/v84 /datasets/airlinemod/airlinesmall.csv
    myresults
  12. Examine the results using the Hadoop command.

    !./hadoop fs -cat myresults/*
      'MaxArrivalDelay' 	[1014]
    

Other examples of map and reduce functions are available at toolbox/matlab/demos folder. You can use other examples to prototype similar deployable archives that run against Hadoop. For more information, see Build Effective Algorithms with MapReduce.

See Also

| | |

Related Examples

Was this topic helpful?