Create Deployable Archive to Run Against Hadoop from Command Line

This example shows how to create a deployable archive with mcc command that calculates mean airline delays. The archive that you create contains all the MATLAB® content associated with the component. The mcc command creates a shell script to run the deployable archive against Hadoop®. You can use shell script to customize the execution of the deployable archive within your particular Hadoop environment.

This example uses the MaxMapReduceExample.m example file and the airline dataset, airlinesmall.csv, both available at the toolbox/matlab/demos folder. Move your example code to a new working folder for deployment. The new working folder on the path ensures that the files are accessible by MATLAB Compiler™.

    Note:   Deployable archive that runs against Hadoop using Hadoop Compiler app is supported only on Linux®.

  1. Set environment variables and cluster properties for your Hadoop configuration. These properties are necessary for submitting jobs to your Hadoop cluster.

    1. Set up the environment variable, HADOOP_HOME to point at your Hadoop install folder. Modify the system path to include $HADOOP_HOME/bin.

    2. Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster. The following example uses /hd-shared/MCR/v84.

      Download the MATLAB Runtime from the website at http://www.mathworks.com/products/compiler/mcr.

    3. Copy the airlinesmall.csv into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.

    4. Copy the map function maxArrivalDelayMapper.m from toolbox/matlab/demos folder to the working folder.

      function maxArrivalDelayMapper (data, info, intermKVStore)
      partMax = max(data.ArrDelay);
      add(intermKVStore,'PartialMaxArrivalDelay',partMax);

      For more information, see Write a Map Function.

    5. Copy the reduce function maxArrivalDelayReducer.m from toolbox/matlab/demos folder to the working folder.

      function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore)
      maxVal = -inf;
      while hasnext(intermValIter)
         maxVal = max(getnext(intermValIter), maxVal);
      end
      add(outKVStore,'MaxArrivalDelay',maxVal);

      For more information, see Write a Reduce Function.

  2. Create a datastore object from the MaxMapReduceExample.m and save the datastore to a .mat file.

    ds = datastore('airlinesmall.csv','TreatAsMissing','NA',...
        'SelectedVariableNames','ArrDelay','ReadSize',1000);
    save('airlinesmall.mat','ds')

    For more information, Getting Started with Datastore

  3. A Hadoop settings file specifies input type tabulartext, output type binary, the map function, the reduce function, and previously created datastore.

    mw.ds.in.type = tabulartext
    mw.ds.in.format = airlinesmall.mat
    mw.ds.out.type = binary
    mw.mapper = maxArrivalDelayMapper
    mw.reducer = maxArrivalDelayReducer
    For more information, see Hadoop Settings File.

  4. Use the mcc command with the -m flag to create a deployable archive. The -m flag creates a standard executable that can be run from a command line. However, the mcc command cannot package the results in an installer. The command must be entered as a single line.

    mcc -H -W 'hadoop:airlinesmall,CONFIG:MWHadoopSetting.txt'
      maxArrivalDelayMapper.m maxArrivalDelayReducer.m
       -a airlinesmall.mat

    For more information, see mcc.

    MATLAB Compiler creates a shell script run_maxarrivaldelay.sh, a deployable archive airlinesmall.ctf, and a log file mccExcludedfiles.log.

  5. Deploy the archive as a Hadoop job by pointing the job to the csv files in the airline dataset. The arguments in the command are MCRRoot, Hadoop properties defined using -D flag, the data file, and the new results folder. The command must be entered as a single line.

     !./run_airlinesmall.sh /hd-shared/MCR/v84
       -D mw.mcrroot = /hd-shared/MCR/v84 "/datasets/airline/*.csv"
       myresults
  6. Visualize and plot the results.

    ds = datastore('hdfs://hadoop01/user/username/myresults/part*',...
         'Type', 'keyvalue')
    airlinesmallResult = readall(ds)
               Key             Value
        __________________    ________
    
         'MaxArrivalDelay'     [1014]
    

Other examples of map and reduce functions are available at toolbox/matlab/demos folder. You can use other examples to prototype similar deployable archives that run against Hadoop. For more information, see Build Effective Algorithms with MapReduce.

See Also

| | | |

Related Examples

Was this topic helpful?