Data Processing and Analytics Jobs

The following topics describe job types for data processing and analytics platforms and services:

Job:AWS Athena
Job:AWS Data Pipeline
Job:AWS DynamoDB
Job:AWS EMR
Job:Azure Databricks
Job:Azure HDInsight
Job:Azure Synapse
Job:Databricks
Job:DBT
Job:GCP BigQuery
Job:GCP DataFlow
Job:GCP Dataproc
Job:Hadoop
Job:OCI Data Flow
Job:Snowflake

Job:AWS Athena

AWS Athena enables you to process, analyze, and store your data in the cloud.

To deploy and run an AWS Athena job, ensure that you have installed the AWS Athena plug-in with the provision image command or the provision agent::update command.

The following example shows how to define an AWS Athena job. This JSON-based job executes a SQL-based query:

Copy

"AWS Athena_Job_2":
{
   "Type": "Job:AWS Athena",
   "ConnectionProfile": "AWSATHENA",
   "Athena Client Request Token": "aws-athena-client-request-token-%%ORDERID-%%TIME",
   "DB Catalog Name": "DB_Catalog_Athena",
   "Database Name": "DB_Athena",
   "Action": "Query",
   "Query": "Select * from Athena_Table",
   "Output Location": "s3://{BucketPath}",
   "Workgroup": "Primary",
   "Add Configurations": "checked",
   "S3 ACL Option": "BUCKET_OWNER_FULL_CONTROL",
   "Encryption Options": "SSE_KMS",
   "KMS Key": "arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst",
   "Bucket Owner": "Account_ID",
   "Show JSON Output": "unchecked",
   "Status Polling Frequency": "10",
   "Tolerance": "2"
}

The following table describes the AWS Athena job parameters.

Parameter	Description
Connection Profile	Defines the ConnectionProfile:AWS Athena name that connects Control-M to AWS Athena.
Athena Client Request Token	Defines a unique ID (idempotency token), which guarantees that the job executes only once. Default: aws-athena-client-request-token-%%ORDERID-%%TIME
DB Catalog Name	Defines the name of the group of databases (catalog) that the query references.
Database Name	Defines the name of the database that the query references.
Action	Determines which of the following queries executes: Query: Executes the query that you enter in the Query attribute. Run Prepared Query: Executes a predefined query that is stored in the AWS Athena platform. Query and Create Table: Executes the query that you enter in the Query attribute and saves the results to a new table. Unload: Executes the query that you enter in the Query attribute and saves the results to a file in an Amazon S3 bucket.
Query	Defines the SQL-based query that executes.
Prepared Query Name	Defines the name of the predefined query that is stored in the AWS Athena platform.
Table Name	Defines the name of the table that is created, which is populated by the results of a query in AWS Athena.
Unload File Type	Determines the file format that the query results are saved in, as follows: JSON CSV ORC Parquet Avro Text File
Output Location	Defines the AWS S3 bucket path where the file is saved, as follows. s3://<path> AWS Athena automatically generates a filename that incorporates the Query Execution ID, which is a unique ID applied to each query that is executed.
Workgroup	Defines the workgroup for this job. Workgroups can consist of users, teams, applications, or workloads, and can set limits on the data that each query or group processes.
Add Configurations	Determines whether to add additional job definitions. Valid Values: checked unchecked Default: unchecked
S3 ACL Option	Defines the Amazon S3 canned access control list (ACL), which is a predefined set of grantees and permissions assigned to your stored query results. BUCKET_OWNER_FULL_CONTROL is the only canned ACL that is currently supported in AWS Athena. This setting gives you and the bucket owner full control of the query results.
Encryption Options	Determines one of the following ways to encrypt the query results: SSE_S3: Encrypts the data in the Amazon S3 with Server-Side Encryption (SSE) and Amazon S3-managed encryption keys. SSE_KMS: Encrypts the data in the Amazon S3 with SSE and the AWS Key Management Service (KMS), which enables you to manage the encryption keys. CSE_KMS: Encrypts the data in the Amazon S3 object storage with SSE and enables you to provide your own encryption keys.
KMS Key	(SSE_KMS and CSE_KMS only) Defines the Amazon Resource Name (ARN) of the KMS key. An ARN is a standardized AWS resource address. arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst
Bucket Owner	Defines the AWS account ID of the Amazon S3 bucket owner.
Show JSON Output	Determines whether to show the full JSON API response in the job output. Valid Values: checked unchecked Default: unchecked
Status Polling Frequency	Determines the number of seconds to wait before checking the status of the job. Default: 10
Tolerance	Determines the number of times to check the job status before ending Not OK. Default: 2

Job:AWS Data Pipeline

AWS Data Pipeline is a cloud-based extract, transform, load (ETL) service that enables you to automate the transfer, processing, and storage of your data.

To deploy and run an AWS Data Pipeline job, ensure that you have installed the AWS Data Pipeline plug-in with the provision image command or the provision agent::update command.

The following examples show how to define an AWS Data Pipeline job.

This JSON-based job creates a pipeline:

Copy

"AWS Data Pipeline_Job":
{
   "Type": "Job:AWS Data Pipeline",
   "ConnectionProfile": "AWSDATAPIPELINE",
   "Action": "Create Pipeline",
   "Pipeline Name": "demo-pipeline",
   "Pipeline Unique Id": "235136145",
   "Parameters": 
   {
      "parameterObjects": [
      {
         "attributes": [
         {
            "key": "description",
            "stringValue": "S3outputfolder"
         } ],
         "id": "myS3OutputLoc"
      } ],
      "parameterValues": [
      {
         "id": "myShellCmd",
         "stringValue": "grep -rc \"GET\" ${INPUT1_STAGING_DIR}/* > ${OUTPUT1_STAGING_DIR}/output.txt"
      } ],
      "pipelineObjects": [
      {
         "fields": [
         {
            "key":"input",
            "refValue":"S3InputLocation"
         },
         {
            "key":"stage",
            "stringValue":"true"
         } ],
         "id": "ShellCommandActivityObj",
         "name": "ShellCommandActivityObj"
      } ]
    }  
    "Trigger Created Pipeline": "checked",
    "Status Polling Frequency": "20",
    "Failure Tolerance": "3"
}

This JSON-based job triggers an existing pipeline:

Copy

"AWS Data Pipeline_Job":
{
   "Type": "Job:AWS Data Pipeline",
   "ConnectionProfile": "AWSDATAPIPELINE",
   "Action": "Trigger Pipeline",
   "Pipeline ID": "df-020488024DNBVFN1S2U",
   "Trigger Created Pipeline": "unchecked",
   "Status Polling Frequency": "20",
   "Failure Tolerance": "3"
}

The following table describes the AWS Data Pipeline job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:AWS Data Pipeline name that connects Control-M to AWS Data Pipeline.
Action	Determines one of the following AWS Data Pipeline actions: Trigger Pipeline: Runs an existing AWS Data Pipeline. Create Pipeline: Creates a new AWS Data Pipeline.
Pipeline Name	(Create Pipeline) Defines the name of the new AWS Data Pipeline.
Pipeline Unique ID	(Create Pipeline) Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be used again. Valid Values: Any alphanumeric characters.
Parameters	(Create Pipeline) Defines the parameter objects, which define the variables, for your AWS Data Pipeline in JSON format. For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference.
Trigger Created Pipeline	(Create Pipeline) Determines whether to run (trigger) the newly created AWS Data Pipeline. Valid Values: checked unchecked This parameter is relevant only for a creation action. For a trigger action, set it to unchecked.
Pipeline ID	(Trigger Pipeline) Determines which pipeline to run (trigger).
Status Polling Frequency	Determines the number of seconds to wait before checking the status of the Data Pipeline job. Default: 20
Failure Tolerance	Determines the number of times to check the job status before ending Not OK. Default: 2

Job:AWS DynamoDB

AWS DynamoDB is a NoSQL database service that enables you to create database tables, execute statements and transactions, export and import data to and from the Amazon S3 storage service.

To deploy and run an AWS DynamoDB job, ensure that you have installed the AWS DynamoDB plug-in with the provision image command or the provision agent::update command.

The following examples show how to define an AWS DynamoDB job.

This JSON-based job executes a statement:

Copy

"AWS DynamoDB_Execute_Statement": 
{
   "Type": "Job:AWS DynamoDB",
   "ConnectionProfile": "ADY",
   "Action": "Execute Statement",
   "Run Statement with Parameter": "checked",
   "Statement": "Select * From IFteam where Id=? OR Name=?",
   "Statement Parameters": "[{\"N\": \"20\"},{\"S\":\"Stas30\"}]"
}

This JSON-based job executes a transaction:

Copy

"AWS DynamoDB_Transaction": 
{
   "Type": "Job:AWS DynamoDB",
   "ConnectionProfile": "ADY",
   "Action": "Execute Transaction",
   "Transaction Statments": "[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]",
   "Host": "dba-tlv-wcpg35",
   "CreatedBy": "emuser",
   "RunAs": "ADY",
   "When": 
   {
      "WeekDays": ["NONE"],
      "MonthDays": ["ALL"],
      "DaysRelation": "OR"
   },
   "eventsToWaitFor": 
   {
      "Type": "WaitForEvents",
      "Events": [
      {
         "Event": "AWS_DynamoDB_Execute_Statement-TO-AWS_DynamoDB_Transaction"
      }]
   }
}

This JSON-based job exports a table to S3:

Copy

"AWS DynamoDB_Export": 
{
   "Type": "Job:AWS DynamoDB",
   "ConnectionProfile": "ADY",
   "Action": "Export Table To S3",
   "Idempotency Token": "5364@#gert423",
   "Export Format": "DynamoDB JSON",
   "S3 Bucket Name": "stasbucket1",
   "S3 Path Prefix": "TestDynmoExport",
   "S3 Bucket Owner ID": "122343283363",
   "Table ARN": "arn:aws:dynamodb:us-east-1:122343283363:table/IFteam",
   "Host": "dba-tlv-wcpg35",
   "CreatedBy": "emuser",
   "RunAs": "ADY",
   "When": 
   {
      "WeekDays": ["NONE"],
      "MonthDays": ["ALL"],
      "DaysRelation": "OR"
   },
   "eventsToWaitFor": 
   {
      "Type": "WaitForEvents",
      "Events": [
      {
         "Event": "AWS_DynamoDB_Transaction-TO-AWS_DynamoDB_Export"
      }]
   }
}

This JSON-based job imports a table from S3:

Copy

AWS DynamoDB_Import": 
{
   "Type": "Job:AWS DynamoDB",
   "ConnectionProfile": "ADY",
   "Action": "Import Table from S3",
   "Idempotency Token": "5364@#gert423",
   "Import Format": "DynamoDB JSON",
   "S3 Bucket Name": "stasbucket1",
   "S3 Path Prefix": "AWSDynamoDB/01690368915115be3974ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz",
   "S3 Bucket Owner ID": "122343283363",
   "Table Creation Parameters": "\"AttributeDefinitions\": [%4E {%4E\"AttributeName\": \"Id\",%4E\"AttributeType\": \"N\"%4E}%4E ],%4E\"KeySchema\": [%4E{%4E\"AttributeName\": \"Id\",%4E\"KeyType\": \"HASH\"%4E}%4E],%4E \"BillingMode\": \"PROVISIONED\",%4E\"ProvisionedThroughput\": {%4E\"ReadCapacityUnits\": 1,%4E \"WriteCapacityUnits\": 1%4E}",
   "Table Name": "NewTAB",
   "Host": "dba-tlv-wcpg35",
   "CreatedBy": "emuser",
   "RunAs": "ADY",
   "When": 
   {
      "WeekDays": ["NONE"],
      "MonthDays": ["ALL"],
      "DaysRelation": "OR"
   },
   "eventsToWaitFor": 
   {
      "Type": "WaitForEvents",
      "Events": [
      {
         "Event": "AWS_DynamoDB_Export-TO-AWS_DynamoDB_Import"
      }]
   }
}

The following table describes the AWS DynamoDB job type attributes.

Attribute	Action	Description
ConnectionProfile	All Actions	Defines the ConnectionProfile:AWS DynamoDB name that connects Control-M to AWS DynamoDB.
Action	All Actions	Determines one of the following AWS DynamoDB actions to perform: Execute Statement Execute Transaction Export Job to S3 Bucket Import Job from S3 Bucket
Run Statement with Parameter	Execute Statement	Determines whether to execute the statement with your own JSON parameters. Valid Values: checked unchecked Default: unchecked
Statement	Execute Statement	Defines one or more the PartiQL statement that are supported by AWS DynamoDB.
Statement Parameters	Execute Statement	Defines the parameters for the AWS DynamoDB job, in JSON format, that enable you to control how the job executes, as appears in the following example: Copy `[{\"N\": \"20\"},{\"S\":\"Stas30\"}]`
Transaction Statements	Execute Transaction	Defines one or more PartiQL transaction statements, as appears in the following example: Copy `[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]`
Idempotency Token	Export Job to S3 Bucket Import Job from S3 Bucket	Defines the unique ID (idempotency token) that guarantees the job is executed only once. After successful execution, this ID cannot be used again.
Export Format	Export Job to S3 Bucket	Determines one of the following formats to export data: DYNAMODB JSON ION
Import Format	Import Job from S3 Bucket	Determines one of the following formats of the source data: CSV DYNAMODB JSON ION
S3 Bucket Name	Export Job to S3 Bucket Import Job from S3 Bucket	Defines the Amazon S3 bucket name to export and import to and from the table.
S3 Path Prefix	Export Job to S3 Bucket Import Job from S3 Bucket	Defines the Amazon S3 bucket prefix to use as the filename and path of the table. AWSDynamoDB/01654668915125-be3574ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz
S3 Bucket Owner ID	Export Job to S3 Bucket Import Job from S3 Bucket	Defines the ID of the AWS account that owns the bucket.
Table ARN	Export Job to S3 Bucket Import Job from S3 Bucket	Defines the Amazon Resource Name (ARN) associated with the table to export.
Import Compression Type	Import Job from S3 Bucket	Determines one of the following compression types to compress the data from the imported table: GZIP ZSTD No Compression
Table Creation Parameters	Import Job from S3 Bucket	Defines the name of the new table where the data is imported, as appears in the following example: Copy `"Attribute Definitions": [ { "AttributeName": "Id". "AttributeType": "N" }] "KeySchema": [ { "AttributeName": "Id". "KeyType": "HASH" }] "BillingMode": "PROVISIONED", "ProvisionedThroughput": { "RealCapacityUnits": 1, "WriteCapacityUnits": 1 }`
Table Name	Import Job from S3 Bucket	Defines the name of the new table where the data is imported.
Status Polling Frequency	All Actions	Determines the number of seconds to wait before checking the status of the job. Default: 20
Failure Tolerance	Export Job to S3 Bucket Import Job from S3 Bucket	Determines the number of times to check the job status before ending Not OK. Default: 0

Job:AWS EMR

Amazon Web Services (AWS) EMR is a managed cluster platform that enables you to execute big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.

To deploy and run an AWS EMR job, ensure that you have installed the AWS EMR plug-in with the provision image command or the provision agent::update command.

The following example shows how to define an AWS EMR job:

Copy

"AWS EMR_Job_2":
{
   "Type": "Job:AWS EMR",
   "ConnectionProfile": "AWS_EMR",
   "Cluster ID": "j-21PO60WBW77GX",
   "Notebook ID": "e-DJJ0HFJKU71I9DWX8GJAOH734",
   "Relative Path": "ShowWaitingAndRunningClusters.ipynb",
   "Notebook Execution Name": "TestExec",
   "Service Role": "EMR_Notebooks_DefaultRole",
   "Use Advanced JSON Format": "unchecked",
}

The following table describes the AWS EMR job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:AWS EMR name that connects Control-M to AWS EMR.
Cluster ID	Defines the name of the AWS EMR cluster that connects to the Notebook. In the EMR API, the cluster ID is also known as the Execution Engine ID.
Notebook ID	Determines which Notebook ID executes the script. In the EMR API, the Notebook ID is also known as the Editor ID.
Relative Path	Defines the full directory path and filename of the script in the Notebook.
Notebook Execution Name	Defines the job execution name.
Service Role	Defines the service role that connects to the Notebook.
Use Advanced JSON Format	Determines whether to provide Notebook execution information through JSON code. Valid Values: checked unchecked Default: unchecked If you set this parameter to checked, the JSON Body parameter replaces several other parameters discussed above (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role).
JSON Body	Defines Notebook execution settings in JSON format. For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference. JSON Body is relevant only if you set Use Advanced JSON Format to checked. Copy `"EditorId": "e-DJJ0HFJKU71I9DWX8GJAOH734", "RelativePath": "ShowWaitingAndRunningClustersTest2.ipynb", "NotebookExecutionName":"Tests", "ExecutionEngine": { "Id": "j-AR2G6DPQSGUB" }, "ServiceRole": "EMR_Notebooks_DefaultRole"`

Job:Azure Databricks

Azure Databricks is a cloud-based data analytics platform that enables you to process and analyze large workloads of data.

To deploy and run an Azure Databricks job, ensure that you have installed the Azure Databricks plug-in with the provision image command or the provision agent::update command.

The following example shows how to define an Azure Databricks job:

Copy

"Azure Databricks notebook":
{
   "Type": "Job:Azure Databricks",
   "ConnectionProfile": "AZURE_DATABRICKS",
   "Databricks Job ID: "65",
   "Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
   "Idempotency Token": "Control-M-Idem_%%ORDERID",
   "Status Polling Frequency": "30"
}

The following table describes the Azure Databricks job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Azure Databricks name that connects Control-M to Azure Databricks.
Databricks Job ID	Defines the job ID created in your Databricks workspace.
Parameters	Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type. "notebook_params":<"param1":"val1", "param2":"val2"> "jar_params": ["param1", "param2"] For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided through the Azure Databricks documentation. For no parameters, specify a value of "params": <>. "Parameters": "params": <>
Idempotency Token	(Optional) Defines a token to use to rerun job runs that timed out in Databricks. Valid Values: Control-M-Idem_%%ORDERID: With this token, upon rerun, Control-M invokes the monitoring of the existing job run in Databricks. Default. Any other value: Replaces the Control-M idempotency token. When you rerun a job using a different token, Databricks creates a new job run with a new unique run ID.
Status Polling Frequency	(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 30

Job:Azure HDInsight

Azure HDInsight enables you to execute an Apache Spark batch job and perform big data analytics.

To deploy and run an Azure HDInsight job, ensure that you have installed the Azure HDInsight plug-in with the provision image command or the provision agent::update command.

The following example shows how to define an Azure HDInsight job:

Copy

"Azure HDInsight_Job": 
{
   "Type": "Job:Azure HDInsight",
   "ConnectionProfile": "AZUREHDINSIGHT",
   "Parameters": "
   {
      "file" : "wasb://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/sample.jar",
      "args" : ["arg0", "arg1"],
      "className" : "com.sample.Job1",
      "driverMemory" : "1G",
      "driverCores" : 2,
      "executorMemory" : "1G",
      "executorCores" : 10,
      "numExecutors" : 10
   },
   "Status Polling Interval": "10",
   "Bring job logs to output": "checked"
}

The following table describes the Azure HDInsight job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Azure HDInsight name that connects Control-M to Azure HDInsight.
Parameters	Defines parameters to be passed on the Apache Spark application during job execution, in JSON format (name:value pairs). This JSON must include the file and className elements. For more information about common parameters, see Batch Job in the Azure HDInsight documentation.
Status Polling Interval	Defines the number of seconds to wait before verification of the Apache Spark batch job. Default: 10
Bring job logs to output	Determines whether logs from Apache Spark are shown in the job output. Valid Values: checked unchecked Default: unchecked

Job:Azure Synapse

Azure Synapse Analytics enables you to perform data integration and big data analytics.

To deploy and run an Azure Synapse job, ensure that you have installed the Azure Synapse plug-in with the provision image command or the provision agent::update command.

The following example shows how to define an Azure Synapse job:

Copy

"Azure Synapse_Job": 
{
   "Type": "Job:Azure Synapse",
   "ConnectionProfile": "AZURE_SYNAPSE",
   "Pipeline Name": "ncu_synapse_pipeline",
   "Parameters": "{\"periodinseconds\":\"40\", \"param2\":\"val2\"}",
   "Status Polling Interval": "20"
}

The following table describes the Azure Synapse job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Azure Synapse name that connects Control-M to Azure Synapse.
Pipeline Name	Defines the name of a pipeline that you defined in your Azure Synapse workspace.
Parameters	Defines pipeline parameters to override when the job runs, defined in JSON format as pairs of name and value, as follows. <\"param1\":\"val1\", \"param2\":\"val2\"> For no parameters, specify <>.
Status Polling Interval	(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 20

Job:Databricks

Databricks enables you to integrate jobs created in the Databricks environment with your existing Control-M workflows.

To deploy and run a Databricks job, ensure that you have installed the Databricks plug-in with the provision image command or the provision agent::update command.

The following example shows how to define a Databricks job:

Copy

"Databricks_Job":
{
   "Type": "Job:Databricks",
   "ConnectionProfile": "DATABRICKS",
   "Databricks Job ID": "91",
   "Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
   "Idempotency Token": "Control-M-Idem_%%ORDERID",
   "Status Polling Frequency": "30"
}

The following table describes the Databricks job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Databricks name that connects Control-M to Databricks.
Databricks Job ID	Determines the job ID created in your Databricks workspace.
Parameters	Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type. "notebook_params":{"param1":"val1", "param2":"val2"} "jar_params": ["param1", "param2"] For more information about the parameter types, review RunParameters properties in the OpenAPI specification provided through the Azure Databricks documentation. For no parameters, specify a value of "params": {}. "Parameters": "params": {}
Idempotency Token	(Optional) Defines a token to use to rerun job runs that timed out in Databricks. Valid Values: Control-M-Idem_%%ORDERID: With this token, upon rerun, Control-M invokes the monitoring of the existing job run in Databricks. <Any Other Value>: Replaces the Control-M idempotency token. When you rerun a job using a different token, Databricks creates a new job run with a new unique run ID. Default: Control-M-Idem_%%ORDERID
Status Polling Frequency	(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 30

Job:DBT

Data Build Tool (dbt) is a cloud-based computing platform that enables you to develop, test, schedule, document, and analyze data models.

To deploy and run a dbt job, ensure that you have installed the dbt plug-in with the provision image command or the provision agent::update command.

The following example shows how to define a dbt job:

Copy

"DBT_Job_2":
{
   "Type": "Job:DBT",
   "ConnectionProfile": "DBT_CP",
   "DBT Job Id": "12345",
   "Run Comment": "A DBT job",
   "Override Job Commands": "checked",
   "Variables": [
   {
      "UCM-DefineCommands-N001-element": "dbt test"
   },
   {
      "UCM-DefineCommands-N002-element": "dbt run"
   } ],
   "Status Polling Frequency": "10",
   "Failure Tolerance": "2"
}

The following table describes the dbt job parameters.

Parameter	Description
Connection Profile	Defines the ConnectionProfile:DBT name that connects Control-M to dbt.
DBT Job ID	Defines the ID of the preexisting job in the dbt platform that you want to run.
Run Comment	Defines a free-text description of the job.
Override Job Commands	Determines whether to override the predefined dbt job commands. Valid Values: checked unchecked Default: unchecked
Variables	Defines the new dbt job commands as variable pairs, as follows: "UCM-DefineCommands-Nnnn-element": "command string" where nnn is a counter for the sequential position of each command.
Status Polling Frequency	Determines the number of seconds to wait before checking the status of the job. Default: 10
Failure Tolerance	Determines the number of times to check the job status before ending Not OK. Default: 2

Job:GCP BigQuery

Google Cloud Platform (GCP) BigQuery is a cloud-computing platform that enables you to process, analyze, and store your data.

To deploy and run a GCP BigQuery job, ensure that you have installed the GCP BigQuery plug-in with the provision image command or the provision agent::update command.

The following example shows how to define a GCP BigQuery job for a Query action in GCP BigQuery:

Copy

"GCP BigQuery_query":
{
   "Type": "Job:GCP BigQuery",
   "ConnectionProfile": "BIGQSA",
   "Action": "Query",
   "Project Name": "proj",
   "Dataset Name": "Test",
   "Run Select Query and Copy to Table": "checked",
   "Table Name": "IFTEAM",
   "SQL Statement": "select user from IFTEAM2",
   "Query Parameters":
   {
      "name": "IFteam",
      "paramterType":
      { 
         "type": "STRING"
      },
      "parameterValue":
      {
         "value": "BMC"
      }
   },   
   "Job Timeout": "30000",
   "Connection Timeout": "10",
   "Status Polling Frequency": "5"
}

The following table describes the GCP BigQuery job parameters.

Parameter	Action	Description
ConnectionProfile	All Actions	Defines the ConnectionProfile:GCP BigQuery name that connects Control-M to GCP BigQuery.
Action	N/A	Determines one of the following GCP BigQuery actions to perform: Query: Runs one or more SQL statements that are supported by GCP BigQuery. Copy: Creates a copy of an existing table. Load: Loads source data into an existing table. Extract: Exports data from an existing table into Google Cloud Storage. Routine: Runs a stored procedure, table function, or previously defined function.
Project Name	All Actions	Determines the project that the job uses.
Dataset Name	Query Extract Routine	Determines the database that the job uses.
Run Select Query and Copy to Table	Query	(Optional) Determines whether to paste the results of a SELECT statement into a new table.
Table Name	Query Extract	Defines the new table name.
SQL Statement	Query	Defines one or more SQL statements supported by GCP BigQuery. Rule: It must be written in a single line, with character strings separated by one space only.
Query Parameters	Query	Defines the query parameters, which enables you to control the presentation of the data. Copy `"name": "IFteam", "paramterType": { "type": "STRING" }, "parameterValue": { "value": "BMC" }`
Copy Operation Type	Copy	Determines one of the following copy operations: Clone: Creates a copy of a base table that has write access. Snapshot: Creates a read-only copy of a base table. Copy: Creates a copy of a snapshot. Restore: Creates a writable table from a snapshot.
Source Table Properties	Copy	Defines the properties of the table that is cloned, backed up, or copied, in JSON format. You can copy or back up one or more tables at a time. Copy `{ "datasetId": "Test1", "projectId": "SomeProj1", "tableId": "IFteam1" } { "datasetId": "Test2", "projectId": "SomepProj2", "tableId": "IFteam2" }`
Destination Table Properties	Copy Load	Defines the properties of a new table, in JSON format. Copy `{ "datasetId": "Test3", "projectId": "SomeProj3", "tableId": "IFteam3" }`
Destination/Source Bucket URIs	Load Extract	Defines the source or destination data URI for the table that you are loading or extracting. You can load or extract multiple tables. Rule: Separate elements with ,. "gs://source1_site1/source1.json"
Show Load Options	Load	Determines whether to add more fields to a table that you are loading.
Load Options	Load	Defines additional fields for the table that you are loading. Copy `"schema": { "fields": [ { "name": "name1", "type": "STRING1" } { "name": "name2", "type": "STRING2" } { "name": "name3", "type": "STRING3" } ] }`
Extract As	Extract	Determines one of the following file formats to export the data to: CSV JSON
Routine	Routine	Defines a routine and the values that it must run. Call new_r(‘value1’)
Job Timeout	All Actions	Determines the maximum number of milliseconds to run the GCP BigQuery job.
Connection Timeout	All Actions	Determines the number of seconds to wait before the job ends Not OK. Default: 10
Status Polling Frequency	All Actions	Determines the number of seconds to wait before checking the status of the job. Default: 5

Job:GCP DataFlow

Google Cloud Platform (GCP) Dataflow enables you to perform cloud-based data processing for batch and real-time data streaming applications.

To deploy and run a GCP Dataflow job, ensure that you have installed the GCP Dataflow plug-in with the provision image command or the provision agent::update command.

The following example shows how to define a GCP Dataflow job:

Copy

"Google DataFlow_Job_1":
{
   "Type": "Job:GCP DataFlow",
   "ConnectionProfile": "GCPDATAFLOW",
   "Project ID": "applied-lattice-11111",
   "Location": "us-central1",
   "Template Type": "Classic Template",
   "Template Location (gs://)": "gs://dataflow-templates-us-central1/latest/Word_Count",
   "Parameters (JSON Format)": 
   {
      "jobName": "wordcount",
      "parameters": 
      {
         "inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt",
         "output": "gs://controlmbucket/counts"
      }
   }
   "Verification Poll Interval (in seconds)": "10",
   "output Level": "INFO",
   "Host": "host1"
}

The following table describes the GCP Dataflow job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:GCP DataFlow name that connects Control-M to GCP DataFlow.
Project ID	Defines the project ID for your Google Cloud project.
Location	Defines the Google Compute Engine region to create the job.
Template Type	Defines one of the following types of GCP Dataflow templates: Classic Template: Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage. Flex Template: Developers package the pipeline into a Docker image and then use the Google Cloud CLI to build and save the Flex Template spec file in Cloud Storage.
Template Location (gs://)	Defines the path for temporary files. This must be a valid Google Cloud Storage URL that begins with gs://. The default pipeline option tempLocation is used if it has been set in the GCP Dataflow service.
Parameters (JSON Format)	Defines input parameters to be passed on to job execution, in JSON format (name:value pairs). This JSON must include the jobname and parameters elements.
Verification Poll Interval (in seconds)	(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 10
Output Level	Determines one of the following levels of details to retrieve from the GCP outputs in the case of job failure: TRACE DEBUG INFO WARN ERROR
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

Job:GCP Dataproc

Google Cloud Platform (GCP) Dataproc enables you to perform cloud-based big data processing and machine learning.

The following examples show how to define a GCP Dataproc job, which performs cloud-based big data processing and machine learning.

To deploy and run a GCP Dataproc job, ensure that you have installed the AWS Batch plug-in with the provision image command or the provision agent::update command.

The following examples show how to define a GCP Dataproc job.

This JSON defines a job for a GCP Dataproc task of type Workflow Template:

Copy

"Google Dataproc_Job":
{
   "Type": "Job:GCP Dataproc",
   "ConnectionProfile": "GCPDATAPROC",
   "Project ID": "gcp_projectID",
   "Account Region": "us-central1",
   "Dataproc task type": "Workflow Template",
   "Workflow Template": "Template2",
   "Verification Poll Interval (in seconds)": "20",
   "Tolerance": "2"
}

This JSON defines a job for a Dataproc task of type Job:

Copy

"Google Dataproc_Job":
{
   "Type": "Job:GCP Dataproc",
   "ConnectionProfile": "GCPDATAPROC",
   "Project ID": "gcp_projectID",
   "Account Region": "us-central1",
   "Dataproc task type": "Job",
   "Parameters (JSON Format)":
   {
   "job": 
   {
      "placement": {},
      "statusHistory": [],
      "reference":
      {
         "jobId": "job-e241f6be",
         "projectId": "gcp_projectID"
      },
      "labels":
      {
         "goog-dataproc-workflow-instance-id": "44f2b59b-a303-4e57-82e5-e1838019a812",
         "goog-dataproc-workflow-template-id": "template-d0a7c"
      },
      "sparkJob":
      {
         "mainClass": "org.apache.spark.examples.SparkPi",
         "properties": {},
         "jarFileUris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"], 
         "args": ["1000"]
      }
   }
   "Verification Poll Interval (in seconds)": "20",
   "Tolerance": "2"
}

The following table describes the GCP Dataproc job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:GCP Dataproc name that connects Control-M to GCP Dataproc.
Project ID	Defines the project ID for your Google Cloud project.
Account Region	Defines the Google Compute Engine region to create the job.
Dataproc task type	Defines one of the following Dataproc task types to execute: Workflow Template: A reusable workflow configuration that defines a graph of jobs with information on where to run those jobs. Job: A single Dataproc job.
Workflow Template	(Workflow Template) Defines the ID of a Workflow Template.
Parameters	(Job) Defines input parameters to be passed on to job execution, in JSON format. You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings.
Verification Poll Interval	(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 20
Tolerance	Determines the number of times to check the job status before ending Not OK. Default: 2

Job:Hadoop

The Hadoop job connects to the Hadoop framework, and it enables the distributed processing of large data sets across clusters of commodity servers. You can expand your enterprise business workflows to include tasks that execute in your Big Data Hadoop cluster from Control-M with the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.

To deploy and run an Hadoop jobs, ensure that you have done the following:

Installed the Application Pack, which includes the Control-M for Hadoop plug-in.
Created the appropriate type of Hadoop connection profile, as described in ConnectionProfile:Hadoop.

Various types of Hadoop jobs are available for you to define using the Job:Hadoop objects:

Job:Hadoop:Spark:Python
Job:Hadoop:Spark:ScalaJava
Job:Hadoop:Pig
Job:Hadoop:Sqoop
Job:Hadoop:Hive
Job:Hadoop:DistCp (distributed copy)
Job:Hadoop:HDFSCommands
Job:Hadoop:HDFSFileWatcher
Job:Hadoop:Oozie
Job:Hadoop:MapReduce
Job:Hadoop:MapredStreaming

Job:Hadoop:Spark:Python

The following example shows how to use Job:Hadoop:Spark:Python to run a Spark Python program:

Copy

"ProcessData":
{
   "Type": "Job:Hadoop:Spark:Python",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER", 
   "SparkScript": "/home/user/processData.py"
}

The following table describes the Hadoop Spark Python job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Spark Python job optional parameters:

Copy

"ProcessData1":
{
   "Type": "Job:Hadoop:Spark:Python",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "SparkScript": "/home/user/processData.py",
   "Arguments": ["1000", "120" ],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
   },
   "SparkOptions": [
   {
      "--master": "yarn"
   },
   {
      "--num":"-executors 50"
   } ]
}

The following table describes the Hadoop Spark Python job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Spark:ScalaJava

The following example shows how to use Hadoop Scala Java job to run a Spark Java or Scala program:

Copy

"ProcessData":
{
   "Type": "Job:Hadoop:Spark:ScalaJava",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar": "/home/user/ScalaProgram.jar",
   "MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName"
}

The following table describes the Hadoop Scala Java job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Scala Java job optional parameters:

Copy

"ProcessData1":
{
   "Type": "Job:Hadoop:Spark:ScalaJava",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar": "/home/user/ScalaProgram.jar"
   "MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName",
   "Arguments": ["1000", "120" ],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands": 
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
   },
   "SparkOptions": [
   {
      "--master": "yarn"
   },
   {
      "--num":"-executors 50"
   } ]
}

The following table describes the Hadoop Scala Java job optional parameters.

Parameter	Description
PreCommands and PostCommands	Defines HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.
FailJobOnCommandFailure	Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Parameter

Description

PreCommands and PostCommands

Defines HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Pig

The following example shows how to use Hadoop Pig to run a Pig script:

Copy

"ProcessDataPig":
{
   "Type" : "Job:Hadoop:Pig",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "PigScript" : "/home/user/script.pig"
}

The following table describes the Hadoop Pig job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Pig.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Pig job optional parameters:

Copy

"ProcessDataPig1": 
{
   "Type" : "Job:Hadoop:Pig",
   "ConnectionProfile": "DEV_CLUSTER",
   "PigScript" : "/home/user/script.pig",
   "Host" : "edgenode",
   "Parameters" : [
   {
      "amount":"1000"
   },
   {
      "volume":"120"
   } ],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop Pig job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Sqoop

The following example shows how to define a Hadoop Scoop job:

Copy

"LoadDataSqoop":
{
   "Type" : "Job:Hadoop:Sqoop",
   "Host" : "edgenode",
   "ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
   "SqoopCommand" : "import --table foo --target-dir /dest_dir"
}

The following table describes the Hadoop Scoop job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Sqoop.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Scoop job optional parameters:

Copy

"LoadDataSqoop1" :
{
   "Type" : "Job:Hadoop:Sqoop",
   "Host" : "edgenode",
   "ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
   "SqoopCommand" : "import --table foo",
   "SqoopOptions" : [
   {
      "--warehouse-dir":"/shared"
   },
   {
      "--default-character-set":"latin1"
   } ],
   "SqoopArchives" : "",
   "SqoopFiles": "",
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" :[
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop Scoop job parameters.

Parameter	Description
PreCommands and PostCommands	Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.
FailJobOnCommandFailure	Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.
SqoopOptions	Defines the parameters to pass as the specific sqoop tool args.
SqoopArchives	Determines the location of the Hadoop archives.
SqoopFiles	Determines the location of the Sqoop files.

Job:Hadoop:Hive

The following example shows how to use Hadoop Hive to run a Hive beeline job:

Copy

"ProcessHive":
{
   "Type" : "Job:Hadoop:Hive",
   "Host" : "edgenode",
   "ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
   "HiveScript" : "/home/user1/hive.script"
}

The following table describes the Hadoop Hive job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Hive.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Hive job optional parameters:

Copy

"ProcessHive1" :
{
   "Type" : "Job:Hadoop:Hive",
   "Host" : "edgenode",
   "ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
   "HiveScript" : "/home/user1/hive.script",
   "Parameters" : [
   {
      "ammount": "1000"
   },
   {
      "topic": "food"
   } ],
   "HiveArchives" : "",
   "HiveFiles": "",
   "HiveOptions" : [
   {
      "hive.root.logger": "INFO,console"
   } ],
   "PreCommands": 
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop Hive job parameters.

Parameter	Description
PreCommands and PostCommands	Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.
FailJobOnCommandFailure	Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.
HiveSciptParameters	Defines the additional Hadoop command options to pass to beeline as hivevar “name”=”value”.
HiveProperties	Defines the additional Hadoop command options to pass to beeline as hiveconf “key”=”value”.
HiveArchives	Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.archives=”value”.
HiveFiles	Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.files=”value”.

Job:Hadoop:DistCp

The Hadoop Distributed Copy (DistCp) job is used for large inter/intra-cluster copying.

The following example shows how to define a Hadoop DistCp job:

Copy

"DistCpJob" :
{
   "Type" : "Job:Hadoop:DistCp",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "TargetPath" : "hdfs://nns2:8020/foo/bar",
   "SourcePaths" : ["hdfs://nn1:8020/foo/a"]
}

The following table describes the Hadoop DistCp job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Distributed Copy.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine .

This JSON defines the Hadoop DistCp job optional parameters:

Copy

"DistCpJob" :
{
    "Type" : "Job:Hadoop:DistCp",
    "Host" : "edgenode",
    "ConnectionProfile" : "HADOOP_CONNECTION_PROFILE",
    "TargetPath" : "hdfs://nns2:8020/foo/bar",
    "SourcePaths" : ["hdfs://nn1:8020/foo/a", "hdfs://nn1:8020/foo/b" ],
    "DistcpOptions" : [
    {
       "-m":"3"
    },
    {
       "-filelimit ":"100"
    } ]
}

The following table describes the Hadoop DistCp job optional parameters.

Parameter	Description
TargetPath, SourcePaths, and DistcpOptions	Defines the additional Hadoop command options to pass to the distcp tool, as follows: distcp <Options> <TargetPath> <SourcePaths>.

Parameter

Description

TargetPath, SourcePaths, and DistcpOptions

Defines the additional Hadoop command options to pass to the distcp tool, as follows:

distcp <Options> <TargetPath> <SourcePaths>.

Job:Hadoop:HDFSCommands

The following example shows how to define the Hadoop HDFS job that executes one or more HDFS commands:

Copy

"HdfsJob":
{
   "Type" : "Job:Hadoop:HDFSCommands",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "Commands": [
   {
      "get": "hdfs://nn.example.com/user/hadoop/file localfile"
   },
   {
      "rm": "hdfs://nn.example.com/file /user/hadoop/emptydir"
   } ]
}

The following table describes the Hadoop HDFS Commands job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS Commands.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

Job:Hadoop:HDFSFileWatcher

Hadoop HDFS File Watcher runs a job that waits for HDFS file arrival.

The following example shows how to define a Hadoop HDFS File Watcher to run a job that waits for HDFS file arrival:

Copy

"HdfsFileWatcherJob" :
{
   "Type" : "Job:Hadoop:HDFSFileWatcher",
   "Host" : "edgenode",
   "ConnectionProfile" : "DEV_CLUSTER",
   "HdfsFilePath" : "/inputs/filename",
   "MinDetecedSize" : "1",
   "MaxWaitTime" : "2"
}

The following table describes the Hadoop HDFS File Watcher job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS FileWatcher.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.
HdfsFilePath	Defines the full path of the file being watched.
MinDetecedSize	Defines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file.
MaxWaitTime	Defines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes.

Job:Hadoop:Oozie

The following example shows how to define the Hadoop Oozie to run a job that submits an Oozie workflow:

Copy

"OozieJob":
{
   "Type" : "Job:Hadoop:Oozie",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "JobPropertiesFile" : "/home/user/job.properties",
   "OozieOptions" : [
   {
      "inputDir":"/usr/tucu/inputdir"
   },
   {
      "outputDir":"/usr/tucu/outputdir"
   } ]
}

The following table describes the Hadoop Oozie job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Oozie.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.
JobPropertiesFile	Defines the path to the job properties file.

The following table describes the Hadoop Oozie job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

OozieOptions

Determines whether to set or override values for given job property.

Job:Hadoop:MapReduce

The following example shows how to define a Hadoop MapReduce job:

Copy

"MapReduceJob" :
{
   "Type" : "Job:Hadoop:MapReduce",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
   "MainClass" : "pi",
   "Arguments" :["1","2"]
}

The following table describes the Hadoop MapReduce job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines a Hadoop MapReduce optional parameters:

Copy

"MapReduceJob1" :
{
   "Type" : "Job:Hadoop:MapReduce",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
   "MainClass" : "pi",
   "Arguments" :["1","2"],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"} ]
      },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop MapReduce job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:MapredStreaming

The following example shows how to define a Hadoop Mapred Streaming job:

Copy

"MapredStreamingJob1":
{
   "Type": "Job:Hadoop:MapredStreaming",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "InputPath": "/user/robot/input/*",
   "OutputPath": "/tmp/output",
   "MapperCommand": "mapper.py",
   "ReducerCommand": "reducer.py",
   "GeneralOptions": [
   {
      "-D": "fs.permissions.umask-mode=000"
   },
   {
      "-files": "/home/user/hadoop-streaming/mapper.py,/home/user/hadoop-streaming/reducer.py"
   } ]
}

The following table describes the Hadoop Mapred Streaming job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce Streaming.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

The following table describes the Hadoop Mapred Streaming job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

GeneralOptions

Defines the additional Hadoop command options to pass to the hadoop-streaming.jar, including generic options and streaming options.

The following table describes the Hadoop Tajo InputFile job parameters.

The following table describes the Hadoop Tajo Query job parameters.

Parameter	Description
ConnectionProfile	Defines the ConnectionProfile:Hadoop for name that name that connects Control-M to Apache Tajo.
Host	Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.
OpenQuery	Defines the an ad-hoc query to the Apache Tajo warehouse system.

Job:OCI Data Flow

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets.

To deploy and run a OCI Data Flow job, ensure that you have installed the OCI Data Flow plug-in with the provision image command or the provision agent::update command.

The following example shows how to define a OCI Data Flow job:

Copy

"OCI Data Flow": 
{
    "Type": "Job:OCI Data Flow",
    "ConnectionProfile": "OCI_DATAFLOW",
    "Run Name": "CM test run",
    "Compartment OCID": "ocid1.compartment.oc1..aaaaaaaahjoxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "Application OCID": "ocid1.dataflowapplication.oc1.phx.anyhqljrtxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "Additional Run Details": "Yes",
    "Run Details Configuration": "
    { 
       \"displayName\":\"run_name\",
       \"applicationId\":\"application_ocid\", 
       \"compartmentId\":\"compartment_ocid\" 
    }",
    "Status Polling Frequency":"60",
    "Failure Tolerance":"2"
}

The following table describes the OCI Data Flow job attributes.

Attribute	Description
ConnectionProfile	Defines the ConnectionProfile:OCI Data Flow name that connects Control-M to OCI Data Flow.
Run Name	Defines the name of a new Run.
Compartment OCID	Defines the compartment Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each compartment that is created within the Oracle Data Flow Infrastructure.
Application OCID	Defines the application Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each application that is created within the Oracle Data Flow Infrastructure.
Additional Run Details	(Optional) Determines whether to add more parameters to the new job run. Valid Values: No Yes Default: No
Run Details Configuration	(Optional) Defines specific parameters, in JSON format, that are passed when you create a new Run. For more information about the run parameters, see CreateRunDetails Reference 20200129 in the Oracle Cloud Infrastructure Documentation. Copy `{ "displayName": "<run_name>", "applicationId": "<application_ocid>", "compartmentId": "<compartment_ocid>", "driverShape": "VM.Standard.E4.Flex", "executorShape": "VM.Standard.E4.Flex", "numExecutors": 1, "arguments": [], "parameters": [], "configuration": {} }`
Status Polling Frequency	Determines the number of seconds to wait before checking the job status. Default:60
Failure tolerance	Determines the number of times to check the job status before ending Not OK. Default: 2

Job:Snowflake

Snowflake is a cloud-computing platform that enables you to process, analyze, and store your data.

To deploy and run a Snowflake job, ensure that you have installed the Snowflake plug-in with the provision image command or the provision agent::update command.

The following example shows how to define a Job:Snowflake job for a SQL Statement action in Snowflake:

Copy

"Snowflake_Job":
{
   "Type": "Job:Snowflake",
   "ConnectionProfile": "SNOWFLAKE_CONNECTION_PROFILE",
   "Database": "FactoryDB",
   "Schema": "Public",
   "Action": "SQL Statement",
   "Snowflake SQL Statement": "Select * From Table1",
   "Statement Timeout": "60",
   "Show More Options": "unchecked",
   "Show Output": "unchecked",
   "Polling Interval": "20"
}

The following table describes the Job:Snowflake job parameters.

Parameter	Action	Description
Connection Profile	All Actions	Defines one of the following connection profile types that connects Control-M to Snowflake: ConnectionProfile:Snowflake ConnectionProfile:Snowflake IdP
Database	All Actions	Determines the database that the job uses.
Schema	All Actions	Determines the schema that the job uses. A schema is an organizational model that describes layout and definition of the fields and tables, and their relationships to each other, in a database.
Action	N/A	Determines one of the following Snowflake actions to perform: SQL Statement: Runs any number of Snowflake-supported SQL commands, such as queries, calling or creating procedures, database maintenance tasks, and creating and editing tables. Copy from Query: Copies a queried database and schema into an existing or new file in cloud storage. Copy from Table: Copies from an existing table. Create Table and Query: Creates a table, populated by a query, in the specified database and schema. Copy into Table: Copies data from a cloud storage location into an existing table in Snowflake. Start or Pause Snowpipe: Starts or pauses an existing Snowpipe. Stored Procedure: Calls an existing procedure and its arguments. Snowpipe Load Status: Monitors the status of a Snowpipe for a set period of time. Run SQL File: Uploads a file that contains Snowflake-supported SQL commands.
Snowflake SQL Statement	SQL Statement	Determines one or more Snowflake-supported SQL commands. Rule: Must be written in a single line, with strings separated by one space only.
Query to Location	Copy from Query	Defines the cloud storage location.
Query Input	Copy from Query	Defines the query used for copying the data.
Storage Integration	Copy from Query Copy from Table Copy into Table	Defines the storage integration object, which stores an Identity and Access Management (IAM) entity and an optional set of blocked cloud storage locations.
Overwrite	Copy from Query Copy from Table	Determines whether to overwrite an existing file in the cloud storage, as follows: Yes No
File Format	Copy from Query Copy from Table	Determines one of the following file formats for the saved file: JSON CSV
Copy Destination	Copy from Table	Determines where the JSON or CSV file is saved. You can save to Amazon Web Services, Google Cloud Platform, or Microsoft Azure. s3://<bucket name>/
From Table	Copy from Table	Defines the name of the copied table.
Create Table Name	Create Table and Query	Defines the name of the new or existing table where the data is queried.
Query	Create Table and Query	Defines the query used for the copied data.
Snowpipe Name	Start or Pause Snowpipe Snowpipe Load Status	Defines the name of the Snowpipe. A Snowpipe loads data from files when they are ready or staged.
Table Name	Copy into Table	Defines the name of the table that the data is copied into.
From Location	Copy into Table	Defines the cloud storage location from where the data is copied, in CSV or JSON format. s3://location-path/FileName.csv
Start or Pause Snowpipe	Start or Pause Snowpipe	Determines whether to start or pause the Snowpipe, as follows: Start Snowpipe Pause Snowpipe
Stored Procedure Name	Stored Procedure	Defines the name of the stored procedure.
Procedure Argument	Stored Procedure	Defines the value of the argument in the stored procedure.
Table Name	Snowpipe Load Status	Defines the table that is monitored when loaded by the Snowpipe.
Stage Location	Snowpipe Load Status	Defines the cloud storage location. A stage is a pointer that indicates where data is stored, or staged. s3://CloudStorageLocation/
Days Back	Snowpipe Load Status	Determines the number of days to monitor the Snowpipe load status.
Status File Cloud Location Path	Snowpipe Load Status	Defines the cloud storage location where a CSV file log is created. The CSV file log details the load status for each Snowpipe.
Storage Integration	Snowpipe Load Status	Defines the Snowflake configuration for the cloud storage location (as defined in the previous parameter, Status File Cloud Location Path). S3_INT
Load SQL File	Run SQL File	Defines the full path to the file that contains Snowflake-supported SQL commands.
Statement Timeout	All Actions	Determines the maximum number of seconds to run the job in Snowflake.
Show More Options	All Actions	Determines whether the following job-defining attributes are displayed: Parameters Role Bindings Warehouse
Parameters	All Actions	Defines Snowflake-provided parameters that let you control how data is presented, as follows. <"param1":"value1", "param2":"value2">
Role	All Actions	Determines the Snowflake role used for this Snowflake job. A role is an entity that can be assigned privileges on secure objects. You can be assigned one or more roles from a limited selection.
Bindings	All Actions	Defines the values to bind to the variables used in the Snowflake job, in JSON format. For more information about bindings, see the Snowflake documentation. The following JSON defines two binding variables: Copy `"1": { "type": "FIXED", "value": "123" } "2": { "type": "TEXT", "value": "String" }`
Warehouse	All Actions	Determines the warehouse used in the Snowflake job. A warehouse is a cluster of virtual machines that processes a Snowflake job.
Show Output	All Actions	Determines whether to show a full JSON response in the log output. Valid Values: checked unchecked Default: unchecked
Status Polling Frequency	All Actions	Determines the number of seconds to wait before checking the status of the job. Default: 20