These are the scripts I use to download, configure, and deploy several big data frameworks (YARN/MapReduce, Spark) and related systems (HDFS, ZooKeeper, InfluxDB).
Warning: I have not extensively tested these scripts for different users. The scripts assume ownership of the /local/$USER/ directory on every node used in a deployment. In particular, the scripts will wipe the /local/$USER/{hadoop,spark,zookeeper,influxdb} directory before the respective application is deployed.
- Git clone this repository to your home directory on DAS-5. The resulting directory will be referred to as
$DEPLOYER_HOMEthroughout this manual.
Optional if space in your home directory is limited (deployments are likely to generate gigabytes of logs over time):
-
Create a directory in your scratch folder for the big data frameworks and configuration files, e.g.,
/var/scratch/$USER/big-data-frameworks. -
Create a symlink in
$DEPLOYER_HOMEcalledframeworkspointing at the directory you created at point 2.
The deployment scripts can now be used to deploy any of the included frameworks.
The deployer can create a new reservation via preserve or you may use existing reservations. To create a reservation run:
$DEPLOYER_HOME/deployer preserve create-reservation -q -t "$TIMEOUT" $MACHINESwhere $TIMEOUT should be the duration of the reservation in hh:mm:ss format and $MACHINES should be the number of nodes to reserve. The output includes the ID of your reservation.
Use the following (substituting your reservation ID) to check the status of your reservation:
$DEPLOYER_HOME/deployer preserve fetch-reservation $RESERVATION_IDTo get a list of supported frameworks and versions, run:
$DEPLOYER_HOME/deployer list-frameworks --versionsBefore a framework can be deployed, it must be "installed". This only needs to be done once. After installing, the framework can be repeatedly deployed. In the following command, substitute a framework name and version as output by the deployer list-frameworks command.
$DEPLOYER_HOME/deployer install $FRAMEWORK $VERSIONTo deploy a framework, use the deployer deploy -h command for help, or use one of the following standard deployments.
To deploy Hadoop (HDFS and YARN) with sensible defaults, run the following command (substituting your reservation ID):
./deployer deploy --preserve-id $RESERVATION_ID -s env/das5-hadoop.settings hadoop 2.6.0If you do not need HDFS or YARN append the hdfs_enable=false or yarn_enable=false options, respectively, to the above comand.
Note: the deployer launches master processes on the first machine in the reservation (as indicated in the output of the deploy command). To connect to HDFS or YARN, first connect to that machine via SSH and then use Hadoop from the $DEPLOYER/frameworks/hadoop-2.6.0 directory.
To deploy Spark with sensible defaults, run the following command (substituting your reservation ID):
./deployer deploy --preserve-id $RESERVATION_ID -s env/das5-spark.settings spark 2.4.0To connect to Spark using a shell, first connect to the application master via SSH, then run $DEPLOYER_HOME/frameworks/spark-2.4.0/bin/spark-shell to open a Spark session connected to the cluster.