Data Processing Jobs
The following topics describe job attributes that work with Data Processing platforms and services:
Alteryx Trifacta Job
Alteryx Trifacta is a data-wrangling platform that allows you to discover, organize, edit, and publish data in different formats and to multiple clouds, including AWS, Azure, Google, Snowflake, and Databricks.
The following table describes the Trifacta job type attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Alteryx Trifacta. Rules:
|
Flow Name |
Determines which Trifacta flow the job runs. |
Rerun with New Idempotency Token |
Determines whether to allow rerun of the job in Trifacta with a new idempotency token (for example, when the job run times out). |
Idempotent Token |
Defines the unique ID (idempotency token) that guarantees the job run is executed only once. After successful execution, this ID cannot be used again. To allow rerun of the job with a new token, replace the default value with a unique ID that has not been used before. Use the RUN_ID, which can be retrieved from the job output. Default: Control-M-Idem_%%ORDERID — job run cannot be executed again. |
Retrack Job Status |
Determines whether to track job run status as the job run progresses and the status changes (for example, from in-progress to failed or to completed). |
Run ID |
Defines the RUN_ID number for the job run to be tracked. The RUN_ID is unique to each job run and it can be found in the job output. |
Status Polling Frequency |
Determines the number of seconds to wait before checking the status of the Trifacta job. Default: 10 |
AWS Batch Job
AWS Batch enables you to manage and run batch computing workloads in AWS.
The following table describes the AWS Batch job attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to AWS Batch. Rules:
|
Use Advanced JSON Format |
Determines whether you supply your own JSON parameters. |
JSON Format |
Defines the parameters for the batch job, in JSON format, that enable you to control how the job runs. For a description of this JSON syntax, see the description of SubmitJob in the AWS Batch API Reference. Copy
|
Job Name |
Defines the name of the batch job. |
Job Definition and Revision |
Determines which predefined job definition and version number (revision) is applied to the job, depending upon how you complete the field, as follows:
ctm-batch-job-definition:3 |
Job Queue |
Determines the job queue, which stores your batch job. |
Container Overrides Command |
(Optional) Defines a command, in JSON format, that overrides the specified command in the job definition. |
Job Attempts |
(Optional) Determines the number of times to retry a job run, which overrides the retry attempts determined in the job definition. Valid numbers: 1–10 |
Execution Timeout |
(Optional) Determines the number of seconds to wait before a timeout occurs in a batch job, which overrides the specified timeouts in the job definition. |
Status Polling Frequency |
Determines the number of seconds to wait before checking the job status. Default: 20 |
AWS Data Pipeline Job
AWS Data Pipeline is a cloud-based ETL service that enables you to automate the transfer, processing, and storage of your data.
The following table describes the AWS Data Pipeline job attributes.
Attribute |
Action |
Description |
---|---|---|
Connection Profile |
N/A |
Determines the authorization credentials that are used to connect Control-M to AWS Data Pipeline. Rules:
|
Action |
N/A |
Determines one of the following AWS Data Pipeline actions:
|
Pipeline Name |
Create Pipeline |
Defines the name of the new AWS Data Pipeline. |
Pipeline Unique ID |
Create Pipeline |
Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be used again. Valid characters: Any alphanumeric characters. |
Parameters |
Create Pipeline |
Defines the parameter objects, which define the variables, for your AWS Data Pipeline in JSON format. For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference. Copy
|
Trigger Created Pipeline |
Create Pipeline |
Determines whether to run, or trigger, the newly created AWS Data Pipeline. |
Pipeline ID |
Trigger Pipeline |
Determines which pipeline to run, or trigger. |
Status Polling Frequency |
All actions |
Determines the number of seconds to wait before checking the status of the Data Pipeline job. Default: 20 |
Failure Tolerance |
All actions |
Determines the number of times the job tries to run before ending Not OK. Default: 2 |
AWS EMR Job
AWS EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
The following table describes AWS EMR job attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to AWS EMR. Rules:
|
Cluster ID |
Defines the name of the AWS EMR cluster to connect to the Notebook. In the EMR API, this field is called the Execution Engine ID. |
Notebook ID |
Determines which Notebook ID executes the script. In the EMR API, this field is called the Editor ID. |
Relative Path |
Defines the full path and name of the script file in the Notebook. |
Notebook Execution Name |
Defines the job execution name. |
Service Role |
Defines the service role to connect to the Notebook. |
Use Advanced JSON Format |
Enables you to provide Notebook execution information through JSON code. This JSON Body parameter replaces the values of the following parameters (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role). |
JSON Body |
Defines Notebook execution settings in JSON format. For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference. Copy
|
GCP Dataflow Job
The following table describes parameters for a Google Dataflow job, which performs cloud-based data processing for batch and real-time data streaming applications.
Parameter |
Description |
---|---|
Connection profile |
Determines the authorization credentials that are used to connect Control-M to GCP Dataflow. |
Project ID |
Defines the project ID for your Google Cloud project. |
Location |
Defines the Google Compute Engine region to create the job. |
Template Type |
Defines one of the following types of Google Dataflow templates:
|
Template Location (gs://) |
Defines the path for temporary files. This must be a valid Google Cloud Storage URL that begins with gs://. The pipeline option tempLocation is used as the default value, if it has been set. |
Parameters (JSON Format) |
Defines input parameters to be passed on to job execution, in JSON format (name:value pairs). This JSON must include the jobname and parameters elements, as in the following example: Copy
|
Verification Poll Interval (in seconds) |
(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 10 |
Log Level |
Determines one of the following levels of details to retrieve from the GCP logs in the case of job failure:
|
GCP Dataproc Job
The following table describes parameters for a Google Dataproc job, which performs cloud-based big data processing and machine learning.
Parameter |
Description |
---|---|
Connection profile |
Determines the authorization credentials that are used to connect Control-M to GCP Dataproc. |
Project ID |
Defines the project ID for your Google Cloud project. |
Account Region |
Defines the Google Compute Engine region to create the job. |
Dataproc task type |
Defines one of the following Dataproc task types to execute:
|
Workflow Template |
(For a Workflow Template task type) Defines the ID of a Workflow Template. |
Parameters (JSON Format) |
(For a Job task type) Defines input parameters to be passed on to job execution, in JSON format. You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings. |
Verification Poll Interval (in seconds) |
(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 20 |
Tolerance |
Defines the number of call retries during the status check phase. Default: 2 times |
Hadoop Job
The Hadoop job connects to the Hadoop framework, and it enables the distributed processing of large data sets across clusters of commodity servers. You can expand your enterprise business workflows to include tasks running in your Big Data Hadoop cluster from Control-M using the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.
The following table describes the Hadoop job attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Hadoop. Rules:
Variable Name: %%HDP-ACCOUNT |
Execution Type |
Determines the execution type for Hadoop job execution, as follows: Variable Name: %%HDP-EXEC_TYPE |
Pre Commands |
Defines the Pre commands performed before job execution (not for HDFS Commands jobs and Oozie Extractor jobs), and the argument for each command. |
Fail the job if the command fails |
Determines whether the entire job fails if any of the Pre commands fail (not for HDFS Commands jobs and Oozie Extractor jobs). |
Post Commands |
Defines the Post commands performed before job execution (not for HDFS Commands jobs and Oozie Extractor jobs), and the argument for each command. |
Fail the job if the command fails |
Determines whether the entire job fails if any of the Post commands fail (not for HDFS Commands jobs and Oozie Extractor jobs). |
DistCp Job Attributes
The following table describes the DistCp job attributes.
Attribute |
Description |
---|---|
Target Path |
Defines the absolute destination path. Variable Name: %%HDP-DISTCP_TARGET_PATH |
Source Path |
Defines the source paths. Variable Name: %%HDP-DISTCP_SOURCE_PATH-Nxxx_ARG |
Command Line Options |
Defines the sets of attributes and values that are added to the command line. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output |
Distributed Shell Job Attributes
The following table describes the Distributed Shell job attributes.
Attribute |
Description |
---|---|
Shell Type |
Determines what the Distributed Shell job runs, as follows:
Variable Name: %%HDP-SHELL_TYPE |
Command |
Defines the shell command entry to run for the job execution. Variable Name: %%HDP-SHELL_COMMAND |
Script Full Path |
Defines the full path to the script file which is executed. The script file is located in the HDFS. Variable Name: %%HDP-SHELL_SCRIPT_FULL_PATH |
Shell Script Arguments |
Defines the shell script arguments. Variable Name: %%HDP-SHELL-Nxxx-ARG |
More Options |
Opens more attributes. |
Files/Archives |
Defines the full path to the file or archive to upload as a dependency to the HDFS working directory. Variable Names:
|
Options |
Defines the additional option (Name and Value) to set when executing the job. Variable Names:
|
Environment Variables |
Defines the environment variables for the shell script/command. Variable Name: %%HDP-SHELL_ENV_VARIABLE-Nxxx-ARG |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
HDFS Commands Job Attributes
The following table describes the HDFS Commands job attributes.
Attribute |
Description |
---|---|
Command |
Defines the command for the argument to be performed with job execution. Variable Name: %%HDP-HDFS_CMD_ACTION-Nxxx-CMD |
Arguments |
Defines the argument used by the command. Variable Name: %%HDP-HDFS_CMD_ACTION-Nxxx-ARG |
HDFS File Watcher Job Attributes
The following table describes the HDFS File Watcher job attributes.
Attribute |
Description |
---|---|
File name full path |
Defines the full path of the file being watched. Variable Name: %%HDP-HDFS_FILE_PATH |
Min detected size |
Determines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file. Variable Name: %%HDP-MIN_DETECTED_SIZE |
Max time to wait |
Determines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes. Variable Name: %%HDP-MAX_WAIT_TIME |
File Name Variable |
Defines the variable name that is used in succeeding jobs. Variable Name: %%HDP-FW_DETECTED _FILE_NAME_VAR |
Impala Job Attributes
The following table describes the Impala job attributes.
Attribute |
Description |
---|---|
Source |
Determines the source type to run the queries, as follows:
Variable Name: %%HDP-IMPALA_QUERY_SOURCE |
Query File Full Path |
Defines the location of the file used to run the queries. Variable Name: %%HDP-IMPALA_QUERY_FILE_PATH |
Query |
Defines the query command used to run the queries. Variable Name: %%HDP-IMPALA_OPEN_QUERY |
Command Line Options |
Defines the sets of attributes and values that are added to the command line. Variable Name: %%HDP-HDP-IMPALA_CMD_OPTION-Nxxx-ARG |
Hive Job Attributes
The following table describes the Hive job attributes.
Attribute |
Description |
---|---|
Full path to Hive script |
Defines the full path to the Hive script on the Hadoop host. Variable Name: %%HDP-HIVE_SCRIPT_NAME |
Script Parameters |
Defines the list of parameters for the script. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Java-Map-Reduce Job Attributes
The following table describes the Java Map-Reduce job attributes.
Attribute |
Description |
---|---|
Full path to Jar |
Defines the full path to the jar containing the Map Reduce Java program on the Hadoop host. Variable Name: %%HDP-JAVA_JAR_NAME |
Main Class |
Defines the class that is included in the jar containing a main function and the map reduce implementation. Variable Name: %%HDP-JAVA_MAIN_CLASS |
Arguments |
Defines the argument used by the command. Variable Name: %%HDP-JAVA_Nxxx_ARG |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Oozie Job Attributes
The following table describes the Oozie job attributes.
Attribute |
Description |
---|---|
Job Properties File |
Defines the job properties file path. Variable Name: %%HDP-OOZIE_JOB_PROPERTIES_FILE |
Job Properties (Add/Overwrite) |
Defines the Oozie job properties. A set of properties is comprised of the following:
You can add new properties or override property values defined in the Job Properties File. |
Rerun from point of failure |
Determines whether to rerun an Oozie job from the point of its failure. |
Pig Job Attributes
The following table describes the Pig job attributes.
Attribute |
Description |
---|---|
Full Path to Pig Program |
Defines the full path to the Pig program on the Hadoop host. Variable Name: %%HDP-PIG_PROG_NAME |
Pig Program Parameters |
Defines the list of program parameters. |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Properties |
Defines a list of properties (Name and Value) to be executed with the job. These properties override the Hadoop defaults. |
Archives |
Defines the location of the Hadoop archives. |
Files |
Defines the location of the Hadoop files. |
Spark Job Attributes
The following table describes the Spark job attributes.
Attribute |
Description |
---|---|
Program Type |
Determines the Spark program type, as follows:
Variable Name: %%HDP-SPARK_PROG_TYPE |
Full Path to Script |
Defines the full path to the python script to execute. Variable Name: %%HDP-SPARK_FULL_PATH_TO_PYTHON_SCRIPT |
Application Jar File |
Defines the path to the jar including your application and all the dependencies. Variable Name: %%HDP-SPARK_APP_JAR_FULL_PATH |
Main Class to Run |
Defines the main class of the application. Variable Name: %%HDP-SPARK_MAIN_CLASS_TO_RUN |
Application Arguments |
Defines the attribute arguments that are added at the end of the Spark command line either after the main class for Java / Scala Applications or after the script of the Python Script. Variable Name: %%HDP-SPARK_Nxxx_ARG |
Command Line Options |
Defines the sets of attributes and values that are added to the command line. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Sqoop Job Attributes
The following table describes the Sqoop job attributes.
Attribute |
Description |
---|---|
Command Editor |
Defines any valid Sqoop command necessary for job execution. Sqoop can only be used for job execution if defined in Sqoop connection attributes. HDP-SQOOP_COMMAND |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Properties |
Defines a list of properties (Name and Value) to be executed with the job. These properties override the Hadoop defaults. |
Archives |
Defines the location of the Hadoop archives. |
Files |
Defines the location of the Hadoop files. |
Streaming Job Attributes
The following table describes the Streaming job attributes.
Attribute |
Description |
---|---|
Input Path |
Defines the input file for the Mapper step. Variable Name: %%HDP-INPUT_PATH |
Output Path |
Defines the HDFS output path for the Reducer step. Variable Name: %%HDP-OUTPUT_PATH |
Mapper Command |
Defines the command that runs as a mapper. Variable Name: %%HDP-MAPPER_COMMAND |
Reducer Command |
Defines the command that runs as a reducer. Variable Name: %%HDP-REDUCER_COMMAND |
Streaming Options |
Defines the sets of attributes (Name and Value) that are added to the end of the Streaming command line. Variable Names:
|
Generic Options |
Defines the sets of attributes (Name and Value) that are added to the Streaming command line. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Tajo Job Attributes
The following table describes the Tajo job attributes.
Attribute |
Description |
---|---|
Command Source |
Determines the source of the Tajo command, as follows:
|
Full File Path |
Defines the file path of the input file that runs the Tajo command. |
Open Query |
Defines the query. Variable Name: %%HDP-TAJO_OPEN_QUERY |