Create Standalone Application to Run Against Hadoop from Command Line

This example shows how to modify a MATLAB® example that calculates mean airline delays and creates a standalone application. The standalone application is a MATLAB program that runs against Hadoop® using the mcc command. The mapreducer defines the environment for Hadoop.

This example uses the MaxMapReduceExample.m example file and the airline dataset, airlinesmall.csv, both available at the toolbox/matlab/demos folder. Move your example code to a new working folder for deployment. The new working folder on the path ensures that the files are accessible by MATLAB Compiler™.

    Note:   Standalone application that runs against Hadoop using mcc is supported only on Linux®.

  1. Set environment variables and cluster properties for your Hadoop configuration. These properties are necessary for submitting jobs to your Hadoop cluster.

    • Set up the environment variable, HADOOP_HOME to point at your Hadoop install folder. Modify the system path to include $HADOOP_HOME/bin.

      setenv('HADOOP_HOME','/share/hadoop/a1.2.1')
    • Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster. The following steps use /hd-shared/MCR/v84.

      Download the MATLAB Runtime from the website at http://www.mathworks.com/products/compiler/mcr.

    • Copy the airlinesmall.csv into Hadoop Distributed File System (HDFS™) folder /datasets/airlinemod.

    • Copy the map function maxArrivalDelayMapper.m from toolbox/matlab/demos folder to the working folder.

      function maxArrivalDelayMapper (data, info, intermKVStore)
      partMax = max(data.ArrDelay);
      add(intermKVStore,'PartialMaxArrivalDelay',partMax);

      For more information, see Write a Map Function.

    • Copy the reduce function maxArrivalDelayReducer.m from toolbox/matlab/demos folder to the working folder.

      function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore)
      maxVal = -inf;
      while hasnext(intermValIter)
         maxVal = max(getnext(intermValIter), maxVal);
      end
      add(outKVStore,'MaxArrivalDelay',maxVal);

      For more information, see Write a Reduce Function.

  2. Create a datastore that points to the airline data in Hadoop Distributed File System (HDFS) .

    ds = datastore(...
         'hdfs://hadoop01/datasets/airlinemod/airlinesmall.csv',...
         'TreatAsMissing','NA')
    ds.SelectedVariableNames = {'Year','Month',...
         'DayofMonth','UniqueCarrier'};

    If the files are located in HDFS, then the datastore should point to HDFS. For more information, see Read from HDFS.

  3. Create a mapreducer object to set the properties of Hadoop in deployed mode. The mapreducer passes information about the execution environment to standalone applications that run against Hadoop. The mapreducer must point to the location of the MATLAB Runtime that is accessible from all the Hadoop worker nodes.

    mr = mapreducer(matlab.mapreduce.DeployHadoopMapReducer('MCRRoot',...
        '/hd-shared/hadoop-2.2.0/MCR/v84'))

    For more information, see matlab.mapreduce.DeployHadoopMapReducer.

  4. The new application maxMapreduceapp.m consists of a datastore, a mapreducer object that specifies the deployed environment variables, a mapreduce command, and a command to view the results of mapreduce:

    ds = datastore(...
        'hdfs://hadoop01/datasets/airlinemod/airlinesmall.csv',...
        'TreatAsMissing','NA')
    ds.SelectedVariableNames = {'Year','Month','DayofMonth',...
        'UniqueCarrier'};
    mr = mapreducer(matlab.mapreduce.DeployHadoopMapReducer('MCRRoot',...
        '/hd-shared/hadoop-2.2.0/MCR/v84'))
    result = mapreduce(ds,@maxArrivalDelayMapper,@maxArrivalDelayReducer,...
        mr,'OutputType','Binary', ...
        'OutputFolder','hdfs://hadoop01/user/username/myresults');
    maxMapreduceappResult = readall(result)
  5. Use the mcc command with the -m flag to create a standalone application. The -m flag creates a standard executable that can be run from a command line. However, the mcc command cannot package the results in an installer.

    mcc -m maxmapreduceapp.m

    For more information, see mcc.

    MATLAB Compiler creates maxmapreduceapp.m, shell script run_maxarrivaldelay.sh, and a log file mccExcludedfiles.log.

  6. Run the standalone application from MATLAB command prompt using the following command:

    !./maxmapreduce 
            Key           Value
        ____________   _____________
    
            'AA'       [92X1 double]
            'AS'       [92X1 double]
            'CO'       [92X1 double]
            'DL'       [92X1 double]
            'EA'       [92X1 double]	

    Results display in MATLAB.

Other examples of map and reduce functions are available at toolbox/matlab/demos folder. You can use other examples to prototype similar standalone applications that run against Hadoop. For more information, see Build Effective Algorithms with MapReduce.

See Also

| | | |

Related Examples

Was this topic helpful?