Data Processing and Analytics Jobs

The following topics describe job types for data processing and analytics platforms and services:

Job:AWS Athena

AWS Athena enables you to process, analyze, and store your data in the cloud.

To deploy and run an AWS Athena job, ensure that you have done the following:

The following example shows how to define an AWS Athena job. This JSON-based job executes a SQL-based query:

Copy
"AWS Athena_Job_2":
{
   "Type": "Job:AWS Athena",
   "ConnectionProfile": "AWSATHENA",
   "Athena Client Request Token": "aws-athena-client-request-token-%%ORDERID-%%TIME",
   "DB Catalog Name": "DB_Catalog_Athena",
   "Database Name": "DB_Athena",
   "Action": "Query",
   "Query": "Select * from Athena_Table",
   "Output Location": "s3://{BucketPath}",
   "Workgroup": "Primary",
   "Add Configurations": "checked",
   "S3 ACL Option": "BUCKET_OWNER_FULL_CONTROL",
   "Encryption Options": "SSE_KMS",
   "KMS Key": "arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst",
   "Bucket Owner": "Account_ID",
   "Show JSON Output": "unchecked",
   "Status Polling Frequency": "10",
   "Tolerance": "2"
}

The following table describes the AWS Athena job parameters.

Parameter

Description

Connection Profile

Defines the ConnectionProfile:AWS Athena name that connects Control-M to AWS Athena.

Athena Client Request Token

Defines a unique ID (idempotency token), which guarantees that the job executes only once.

Default: aws-athena-client-request-token-%%ORDERID-%%TIME

DB Catalog Name

Defines the name of the group of databases (catalog) that the query references.

Database Name

Defines the name of the database that the query references.

Action

Determines which of the following queries executes:

  • Query: Executes the query that you enter in the Query attribute.

  • Run Prepared Query: Executes a predefined query that is stored in the AWS Athena platform.

  • Query and Create Table: Executes the query that you enter in the Query attribute and saves the results to a new table.

  • Unload: Executes the query that you enter in the Query attribute and saves the results to a file in an Amazon S3 bucket.

Query

Defines the SQL-based query that executes.

Prepared Query Name

Defines the name of the predefined query that is stored in the AWS Athena platform.

Table Name

Defines the name of the table that is created, which is populated by the results of a query in AWS Athena.

Unload File Type

Determines the file format that the query results are saved in, as follows:

  • JSON

  • CSV

  • ORC

  • Parquet

  • Avro

  • Text File

Output Location

Defines the AWS S3 bucket path where the file is saved, as follows.

s3://<path>

AWS Athena automatically generates a filename that incorporates the Query Execution ID, which is a unique ID applied to each query that is executed.

Workgroup

Defines the workgroup for this job.

Workgroups can consist of users, teams, applications, or workloads, and can set limits on the data that each query or group processes.

Add Configurations

Determines whether to add additional job definitions.

Valid Values:

  • checked

  • unchecked

Default: unchecked

S3 ACL Option

Defines the Amazon S3 canned access control list (ACL), which is a predefined set of grantees and permissions assigned to your stored query results.

BUCKET_OWNER_FULL_CONTROL is the only canned ACL that is currently supported in AWS Athena. This setting gives you and the bucket owner full control of the query results.

Encryption Options

Determines one of the following ways to encrypt the query results:

  • SSE_S3: Encrypts the data in the Amazon S3 with Server-Side Encryption (SSE) and Amazon S3-managed encryption keys.

  • SSE_KMS: Encrypts the data in the Amazon S3 with SSE and the AWS Key Management Service (KMS), which enables you to manage the encryption keys.

  • CSE_KMS: Encrypts the data in the Amazon S3 object storage with SSE and enables you to provide your own encryption keys.

KMS Key

(SSE_KMS and CSE_KMS only) Defines the Amazon Resource Name (ARN) of the KMS key.

An ARN is a standardized AWS resource address.

arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst

Bucket Owner

Defines the AWS account ID of the Amazon S3 bucket owner.

Show JSON Output

Determines whether to show the full JSON API response in the job output.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Status Polling Frequency

Determines the number of seconds to wait before checking the status of the job.

Default: 10

Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

Job:AWS Data Pipeline

AWS Data Pipeline is a cloud-based extract, transform, load (ETL) service that enables you to automate the transfer, processing, and storage of your data.

To deploy and run an AWS Data Pipeline job, ensure that you have done the following:

The following examples show how to define an AWS Data Pipeline job.

  • This JSON-based job creates a pipeline:

    Copy
    "AWS Data Pipeline_Job":
    {
       "Type": "Job:AWS Data Pipeline",
       "ConnectionProfile": "AWSDATAPIPELINE",
       "Action": "Create Pipeline",
       "Pipeline Name": "demo-pipeline",
       "Pipeline Unique Id": "235136145",
       "Parameters"
       {
          "parameterObjects": [
          {
             "attributes": [
             {
                "key": "description",
                "stringValue": "S3outputfolder"
             } ],
             "id": "myS3OutputLoc"
          } ],
          "parameterValues": [
          {
             "id": "myShellCmd",
             "stringValue": "grep -rc \"GET\" ${INPUT1_STAGING_DIR}/* > ${OUTPUT1_STAGING_DIR}/output.txt"
          } ],
          "pipelineObjects": [
          {
             "fields": [
             {
                "key":"input",
                "refValue":"S3InputLocation"
             },
             {
                "key":"stage",
                "stringValue":"true"
             } ],
             "id": "ShellCommandActivityObj",
             "name": "ShellCommandActivityObj"
          } ]
        }  
        "Trigger Created Pipeline": "checked",
        "Status Polling Frequency": "20",
        "Failure Tolerance": "3"
    }
  • This JSON-based job triggers an existing pipeline:

    Copy
    "AWS Data Pipeline_Job":
    {
       "Type": "Job:AWS Data Pipeline",
       "ConnectionProfile": "AWSDATAPIPELINE",
       "Action": "Trigger Pipeline",
       "Pipeline ID": "df-020488024DNBVFN1S2U",
       "Trigger Created Pipeline": "unchecked",
       "Status Polling Frequency": "20",
       "Failure Tolerance": "3"
    }

The following table describes the AWS Data Pipeline job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:AWS Data Pipeline name that connects Control-M to AWS Data Pipeline.

Action

Determines one of the following AWS Data Pipeline actions:

  • Trigger Pipeline: Runs an existing AWS Data Pipeline.

  • Create Pipeline: Creates a new AWS Data Pipeline.

Pipeline Name

(Create Pipeline) Defines the name of the new AWS Data Pipeline.

Pipeline Unique ID

(Create Pipeline) Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be used again.

Valid Values: Any alphanumeric characters.

Parameters

(Create Pipeline) Defines the parameter objects, which define the variables, for your AWS Data Pipeline in JSON format.

For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference.

Trigger Created Pipeline

(Create Pipeline) Determines whether to run (trigger) the newly created AWS Data Pipeline.

Valid Values:

  • checked

  • unchecked

This parameter is relevant only for a creation action. For a trigger action, set it to unchecked.

Pipeline ID

(Trigger Pipeline) Determines which pipeline to run (trigger).

Status Polling Frequency

Determines the number of seconds to wait before checking the status of the Data Pipeline job.

Default: 20

Failure Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

Job:AWS DynamoDB

AWS DynamoDB is a NoSQL database service that enables you to create database tables, execute statements and transactions, export and import data to and from the Amazon S3 storage service.

To deploy and run an AWS DynamoDB job, ensure that you have done the following:

The following examples show how to define an AWS DynamoDB job.

  • This JSON-based job executes a statement:

    Copy
    "AWS DynamoDB_Execute_Statement"
    {
       "Type": "Job:AWS DynamoDB",
       "ConnectionProfile": "ADY",
       "Action": "Execute Statement",
       "Run Statement with Parameter": "checked",
       "Statement": "Select * From IFteam where Id=? OR Name=?",
       "Statement Parameters": "[{\"N\": \"20\"},{\"S\":\"Stas30\"}]"
    }
  • This JSON-based job executes a transaction:

    Copy
    "AWS DynamoDB_Transaction"
    {
       "Type": "Job:AWS DynamoDB",
       "ConnectionProfile": "ADY",
       "Action": "Execute Transaction",
       "Transaction Statments": "[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]",
       "Host": "dba-tlv-wcpg35",
       "CreatedBy": "emuser",
       "RunAs": "ADY",
       "When"
       {
          "WeekDays": ["NONE"],
          "MonthDays": ["ALL"],
          "DaysRelation": "OR"
       },
       "eventsToWaitFor"
       {
          "Type": "WaitForEvents",
          "Events": [
          {
             "Event": "AWS_DynamoDB_Execute_Statement-TO-AWS_DynamoDB_Transaction"
          }]
       }
    }
  • This JSON-based job exports a table to S3:

    Copy
    "AWS DynamoDB_Export"
    {
       "Type": "Job:AWS DynamoDB",
       "ConnectionProfile": "ADY",
       "Action": "Export Table To S3",
       "Idempotency Token": "5364@#gert423",
       "Export Format": "DynamoDB JSON",
       "S3 Bucket Name": "stasbucket1",
       "S3 Path Prefix": "TestDynmoExport",
       "S3 Bucket Owner ID": "122343283363",
       "Table ARN": "arn:aws:dynamodb:us-east-1:122343283363:table/IFteam",
       "Host": "dba-tlv-wcpg35",
       "CreatedBy": "emuser",
       "RunAs": "ADY",
       "When"
       {
          "WeekDays": ["NONE"],
          "MonthDays": ["ALL"],
          "DaysRelation": "OR"
       },
       "eventsToWaitFor"
       {
          "Type": "WaitForEvents",
          "Events": [
          {
             "Event": "AWS_DynamoDB_Transaction-TO-AWS_DynamoDB_Export"
          }]
       }
    }
  • This JSON-based job imports a table from S3:

    Copy
    AWS DynamoDB_Import": 
    {
       "Type": "Job:AWS DynamoDB",
       "ConnectionProfile": "ADY",
       "Action": "Import Table from S3",
       "Idempotency Token": "5364@#gert423",
       "Import Format": "DynamoDB JSON",
       "S3 Bucket Name": "stasbucket1",
       "S3 Path Prefix": "AWSDynamoDB/01690368915115be3974ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz",
       "S3 Bucket Owner ID": "122343283363",
       "Table Creation Parameters": "\"AttributeDefinitions\": [%4E {%4E\"AttributeName\": \"Id\",%4E\"AttributeType\": \"N\"%4E}%4E ],%4E\"KeySchema\": [%4E{%4E\"AttributeName\": \"Id\",%4E\"KeyType\": \"HASH\"%4E}%4E],%4E \"BillingMode\": \"PROVISIONED\",%4E\"ProvisionedThroughput\": {%4E\"ReadCapacityUnits\": 1,%4E \"WriteCapacityUnits\": 1%4E}",
       "Table Name": "NewTAB",
       "Host": "dba-tlv-wcpg35",
       "CreatedBy": "emuser",
       "RunAs": "ADY",
       "When"
       {
          "WeekDays": ["NONE"],
          "MonthDays": ["ALL"],
          "DaysRelation": "OR"
       },
       "eventsToWaitFor"
       {
          "Type": "WaitForEvents",
          "Events": [
          {
             "Event": "AWS_DynamoDB_Export-TO-AWS_DynamoDB_Import"
          }]
       }
    }

The following table describes the AWS DynamoDB job type attributes.

Attribute

Action

Description

ConnectionProfile

All Actions

Defines the ConnectionProfile:AWS DynamoDB name that connects Control-M to AWS DynamoDB.

Action

All Actions

Determines one of the following AWS DynamoDB actions to perform:

  • Execute Statement

  • Execute Transaction

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Run Statement with Parameter

Execute Statement

Determines whether to execute the statement with your own JSON parameters.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Statement

Execute Statement

Defines one or more the PartiQL statement that are supported by AWS DynamoDB.

Statement Parameters

Execute Statement

Defines the parameters for the AWS DynamoDB job, in JSON format, that enable you to control how the job executes, as appears in the following example:

Copy
[{\"N\": \"20\"},{\"S\":\"Stas30\"}]

Transaction Statements

Execute Transaction

Defines one or more PartiQL transaction statements, as appears in the following example:

 

Copy
[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]

Idempotency Token

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the unique ID (idempotency token) that guarantees the job is executed only once. After successful execution, this ID cannot be used again.

Export Format

Export Job to S3 Bucket

Determines one of the following formats to export data:

  • DYNAMODB JSON

  • ION

Import Format

Import Job from S3 Bucket

Determines one of the following formats of the source data:

  • CSV

  • DYNAMODB JSON

  • ION

S3 Bucket Name

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the Amazon S3 bucket name to export and import to and from the table.

S3 Path Prefix

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the Amazon S3 bucket prefix to use as the filename and path of the table.

AWSDynamoDB/01654668915125-be3574ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz

S3 Bucket Owner ID

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the ID of the AWS account that owns the bucket.

Table ARN

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the Amazon Resource Name (ARN) associated with the table to export.

Import Compression Type

Import Job from S3 Bucket

Determines one of the following compression types to compress the data from the imported table:

  • GZIP

  • ZSTD

  • No Compression

Table Creation Parameters

Import Job from S3 Bucket

Defines the name of the new table where the data is imported, as appears in the following example:

Copy
"Attribute Definitions": [
{
   "AttributeName": "Id".
   "AttributeType": "N"
}]
"KeySchema": [
{
   "AttributeName": "Id".
   "KeyType": "HASH"
}]
"BillingMode": "PROVISIONED",
"ProvisionedThroughput":
{
   "RealCapacityUnits": 1,
   "WriteCapacityUnits": 1
}

Table Name

Import Job from S3 Bucket

Defines the name of the new table where the data is imported.

Status Polling Frequency

All Actions

Determines the number of seconds to wait before checking the status of the job.

Default: 20

Failure Tolerance

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Determines the number of times to check the job status before ending Not OK.

Default: 0

Job:AWS EMR

Amazon Web Services (AWS) EMR is a managed cluster platform that enables you to execute big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.

To deploy and run an AWS EMR job, ensure that you have done the following:

The following example shows how to define an AWS EMR job:

Copy
"AWS EMR_Job_2":
{
   "Type": "Job:AWS EMR",
   "ConnectionProfile": "AWS_EMR",
   "Cluster ID": "j-21PO60WBW77GX",
   "Notebook ID": "e-DJJ0HFJKU71I9DWX8GJAOH734",
   "Relative Path": "ShowWaitingAndRunningClusters.ipynb",
   "Notebook Execution Name": "TestExec",
   "Service Role": "EMR_Notebooks_DefaultRole",
   "Use Advanced JSON Format": "unchecked",
}

The following table describes the AWS EMR job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:AWS EMR name that connects Control-M to AWS EMR.

Cluster ID

Defines the name of the AWS EMR cluster that connects to the Notebook.

In the EMR API, the cluster ID is also known as the Execution Engine ID.

Notebook ID

Determines which Notebook ID executes the script.

In the EMR API, the Notebook ID is also known as the Editor ID.

Relative Path

Defines the full directory path and filename of the script in the Notebook.

Notebook Execution Name

Defines the job execution name.

Service Role

Defines the service role that connects to the Notebook.

Use Advanced JSON Format

Determines whether to provide Notebook execution information through JSON code.

Valid Values:

  • checked

  • unchecked

Default: unchecked

If you set this parameter to checked, the JSON Body parameter replaces several other parameters discussed above (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role).

JSON Body

Defines Notebook execution settings in JSON format. For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference.

JSON Body is relevant only if you set Use Advanced JSON Format to checked.

Copy
"EditorId": "e-DJJ0HFJKU71I9DWX8GJAOH734",
"RelativePath": "ShowWaitingAndRunningClustersTest2.ipynb",
"NotebookExecutionName":"Tests",
"ExecutionEngine"
{
   "Id": "j-AR2G6DPQSGUB"
},
"ServiceRole": "EMR_Notebooks_DefaultRole"

Job:Azure Databricks

Azure Databricks is a cloud-based data analytics platform that enables you to process and analyze large workloads of data.

To deploy and run an Azure Databricks job, ensure that you have done the following:

The following example shows how to define an Azure Databricks job:

Copy
"Azure Databricks notebook":
{
   "Type": "Job:Azure Databricks",
   "ConnectionProfile": "AZURE_DATABRICKS",
   "Databricks Job ID: "65",
   "Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
   "Idempotency Token": "Control-M-Idem_%%ORDERID",
   "Status Polling Frequency": "30"
}

The following table describes the Azure Databricks job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Azure Databricks name that connects Control-M to Azure Databricks.

Databricks Job ID

Defines the job ID created in your Databricks workspace.

Parameters

Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type.

"notebook_params":<"param1":"val1", "param2":"val2">

"jar_params": ["param1", "param2"]

For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided through the Azure Databricks documentation.

For no parameters, specify a value of "params": <>.

"Parameters": "params": <>

Idempotency Token

(Optional) Defines a token to use to rerun job runs that timed out in Databricks.

Valid Values:

  • Control-M-Idem_%%ORDERID: With this token, upon rerun, Control-M invokes the monitoring of the existing job run in Databricks. Default.

  • Any other value: Replaces the Control-M idempotency token. When you rerun a job using a different token, Databricks creates a new job run with a new unique run ID.

Status Polling Frequency

(Optional) Defines the number of seconds to wait before checking the status of the job.

Default: 30

Job:Azure HDInsight

Azure HDInsight enables you to execute an Apache Spark batch job and perform big data analytics.

To deploy and run an Azure HDInsight job, ensure that you have done the following:

The following example shows how to define an Azure HDInsight job:

Copy
"Azure HDInsight_Job"
{
   "Type": "Job:Azure HDInsight",
   "ConnectionProfile": "AZUREHDINSIGHT",
   "Parameters": "
   {
      "file" : "wasb://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/sample.jar",
      "args" : ["arg0", "arg1"],
      "className" : "com.sample.Job1",
      "driverMemory" : "1G",
      "driverCores" : 2,
      "executorMemory" : "1G",
      "executorCores" : 10,
      "numExecutors" : 10
   },
   "Status Polling Interval": "10",
   "Bring job logs to output": "checked"
}

The following table describes the Azure HDInsight job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Azure HDInsight name that connects Control-M to Azure HDInsight.

Parameters

Defines parameters to be passed on the Apache Spark application during job execution, in JSON format (name:value pairs).

This JSON must include the file and className elements.

For more information about common parameters, see Batch Job in the Azure HDInsight documentation.

Status Polling Interval

Defines the number of seconds to wait before verification of the Apache Spark batch job.

Default: 10

Bring job logs to output

Determines whether logs from Apache Spark are shown in the job output.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Job:Azure Synapse

Azure Synapse Analytics enables you to perform data integration and big data analytics.

To deploy and run an Azure Synapse job, ensure that you have done the following:

The following example shows how to define an Azure Synapse job:

Copy
"Azure Synapse_Job"
{
   "Type": "Job:Azure Synapse",
   "ConnectionProfile": "AZURE_SYNAPSE",
   "Pipeline Name": "ncu_synapse_pipeline",
   "Parameters": "{\"periodinseconds\":\"40\", \"param2\":\"val2\"}",
   "Status Polling Interval": "20"
}

The following table describes the Azure Synapse job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Azure Synapse name that connects Control-M to Azure Synapse.

Pipeline Name

Defines the name of a pipeline that you defined in your Azure Synapse workspace.

Parameters

Defines pipeline parameters to override when the job runs, defined in JSON format as pairs of name and value, as follows.

<\"param1\":\"val1\", \"param2\":\"val2\">

For no parameters, specify <>.

Status Polling Interval

(Optional) Determines the number of seconds to wait before checking the status of the job.

Default: 20

Job:Databricks

Databricks enables you to integrate jobs created in the Databricks environment with your existing Control-M workflows.

To deploy and run a Databricks job, ensure that you have done the following:

The following example shows how to define a Databricks job:

Copy
"Databricks_Job":
{
   "Type": "Job:Databricks",
   "ConnectionProfile": "DATABRICKS",
   "Databricks Job ID": "91",
   "Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
   "Idempotency Token": "Control-M-Idem_%%ORDERID",
   "Status Polling Frequency": "30"
}

The following table describes the Databricks job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Databricks name that connects Control-M to Databricks.

Databricks Job ID

Determines the job ID created in your Databricks workspace.

Parameters

Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type.

"notebook_params":{"param1":"val1", "param2":"val2"}  

"jar_params": ["param1", "param2"]

For more information about the parameter types, review RunParameters properties in the OpenAPI specification provided through the Azure Databricks documentation.

For no parameters, specify a value of "params": {}.

"Parameters": "params": {}

Idempotency Token

(Optional) Defines a token to use to rerun job runs that timed out in Databricks.

Valid Values:

  • Control-M-Idem_%%ORDERID: With this token, upon rerun, Control-M invokes the monitoring of the existing job run in Databricks.

  • <Any Other Value>:  Replaces the Control-M idempotency token. When you rerun a job using a different token, Databricks creates a new job run with a new unique run ID.

Default: Control-M-Idem_%%ORDERID

Status Polling Frequency

(Optional) Determines the number of seconds to wait before checking the status of the job.

Default: 30

Job:DBT

Data Build Tool (dbt) is a cloud-based computing platform that enables you to develop, test, schedule, document, and analyze data models.

To deploy and run a dbt job, ensure that you have done the following:

The following example shows how to define a dbt job:

Copy
"DBT_Job_2":
{
   "Type": "Job:DBT",
   "ConnectionProfile": "DBT_CP",
   "DBT Job Id": "12345",
   "Run Comment": "A DBT job",
   "Override Job Commands": "checked",
   "Variables": [
   {
      "UCM-DefineCommands-N001-element": "dbt test"
   },
   {
      "UCM-DefineCommands-N002-element": "dbt run"
   } ],
   "Status Polling Frequency": "10",
   "Failure Tolerance": "2"
}

The following table describes the dbt job parameters.

Parameter

Description

Connection Profile

Defines the ConnectionProfile:DBT name that connects Control-M to dbt.

DBT Job ID

Defines the ID of the preexisting job in the dbt platform that you want to run.

Run Comment

Defines a free-text description of the job.

Override Job Commands

Determines whether to override the predefined dbt job commands.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Variables

Defines the new dbt job commands as variable pairs, as follows:

"UCM-DefineCommands-Nnnn-element": "command string"

where nnn is a counter for the sequential position of each command.

Status Polling Frequency

Determines the number of seconds to wait before checking the status of the job.

Default: 10

Failure Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

Job:GCP BigQuery

Google Cloud Platform (GCP) BigQuery is a cloud-computing platform that enables you to process, analyze, and store your data.

To deploy and run a GCP BigQuery job, ensure that you have done the following:

The following example shows how to define a GCP BigQuery job for a Query action in GCP BigQuery:

Copy
"GCP BigQuery_query":
{
   "Type": "Job:GCP BigQuery",
   "ConnectionProfile": "BIGQSA",
   "Action": "Query",
   "Project Name": "proj",
   "Dataset Name": "Test",
   "Run Select Query and Copy to Table": "checked",
   "Table Name": "IFTEAM",
   "SQL Statement": "select user from IFTEAM2",
   "Query Parameters":
   {
      "name": "IFteam",
      "paramterType":
      { 
         "type": "STRING"
      },
      "parameterValue":
      {
         "value": "BMC"
      }
   },   
   "Job Timeout": "30000",
   "Connection Timeout": "10",
   "Status Polling Frequency": "5"
}

The following table describes the GCP BigQuery job parameters.

Parameter

Action

Description

ConnectionProfile

All Actions

Defines the ConnectionProfile:GCP BigQuery name that connects Control-M to GCP BigQuery.

Action

N/A

Determines one of the following GCP BigQuery actions to perform:

  • Query:  Runs one or more SQL statements that are supported by GCP BigQuery.

  • Copy: Creates a copy of an existing table.

  • Load: Loads source data into an existing table.

  • Extract: Exports data from an existing table into Google Cloud Storage.

  • Routine: Runs a stored procedure, table function, or previously defined function.

Project Name

All Actions

Determines the project that the job uses.

Dataset Name

  • Query

  • Extract

  • Routine

Determines the database that the job uses.

Run Select Query and Copy to Table

Query

(Optional) Determines whether to paste the results of a SELECT statement into a new table.

Table Name

  • Query

  • Extract

Defines the new table name.

SQL Statement

Query

Defines one or more  SQL statements supported by GCP BigQuery.

Rule: It must be written in a single line, with character strings separated by one space only.

Query Parameters

Query

Defines the query parameters, which enables you to control the presentation of the data.

Copy
"name": "IFteam",
"paramterType":

   "type": "STRING"
},
"parameterValue":
{
   "value": "BMC"
}

Copy Operation Type

Copy

Determines one of the following copy operations:

  • Clone: Creates a copy of a base table that has write access.

  • Snapshot: Creates a read-only copy of a base table.

  • Copy: Creates a copy of a snapshot.

  • Restore:  Creates a writable table from a snapshot.

Source Table Properties

Copy

Defines the properties of the table that is cloned, backed up, or copied, in JSON format.

You can copy or back up one or more tables at a time.

Copy
   {
      "datasetId": "Test1"
      "projectId": "SomeProj1",
      "tableId": "IFteam1"
   }
   {
      "datasetId": "Test2"
      "projectId": "SomepProj2"
      "tableId": "IFteam2"
   }

Destination Table Properties

  • Copy

  • Load

Defines the properties of a new table, in JSON format.

Copy

   "datasetId": "Test3"
   "projectId": "SomeProj3"
   "tableId": "IFteam3" 
}

Destination/Source Bucket URIs

  • Load

  • Extract

Defines the source or destination data URI for the table that you are loading or extracting.

You can load or extract multiple tables.

Rule: Separate elements with ,.

"gs://source1_site1/source1.json"

Show Load Options

Load

Determines whether to add more fields to a table that you are loading.

Load Options

Load

Defines additional fields for the table that you are loading.

Copy
"schema"
{
   "fields": [
   {
      "name": "name1",
      "type": "STRING1"
   }
   {
      "name": "name2",
      "type": "STRING2"
   }
   {
      "name": "name3",
      "type": "STRING3"
   } ]
}

Extract As

Extract

Determines one of the following file formats to export the data to:

  • CSV

  • JSON

Routine

Routine

Defines a routine and the values that it must run.

Call new_r(‘value1’)

Job Timeout

All Actions

Determines the maximum number of milliseconds to run the GCP BigQuery job.

Connection Timeout

All Actions

Determines the number of seconds to wait before the job ends Not OK.

Default: 10

Status Polling Frequency

All Actions

Determines the number of seconds to wait before checking the status of the job.

Default: 5

Job:GCP DataFlow

Google Cloud Platform (GCP) Dataflow enables you to perform cloud-based data processing for batch and real-time data streaming applications.

To deploy and run a GCP Dataflow job, ensure that you have done the following:

The following example shows how to define a GCP Dataflow job:

Copy
"Google DataFlow_Job_1":
{
   "Type": "Job:GCP DataFlow",
   "ConnectionProfile": "GCPDATAFLOW",
   "Project ID": "applied-lattice-11111",
   "Location": "us-central1",
   "Template Type": "Classic Template",
   "Template Location (gs://)": "gs://dataflow-templates-us-central1/latest/Word_Count",
   "Parameters (JSON Format)"
   {
      "jobName": "wordcount",
      "parameters"
      {
         "inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt",
         "output": "gs://controlmbucket/counts"
      }
   }
   "Verification Poll Interval (in seconds)": "10",
   "output Level": "INFO",
   "Host": "host1"
}

The following table describes the GCP Dataflow job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:GCP DataFlow name that connects Control-M to GCP DataFlow.

Project ID

Defines the project ID for your Google Cloud project.

Location

Defines the Google Compute Engine region to create the job.

Template Type

Defines one of the following types of GCP Dataflow templates:

  • Classic Template: Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.

  • Flex Template: Developers package the pipeline into a Docker image and then use the Google Cloud CLI to build and save the Flex Template spec file in Cloud Storage.

Template Location (gs://)

Defines the path for temporary files. This must be a valid Google Cloud Storage URL that begins with gs://.

The default pipeline option tempLocation is used if it has been set in the GCP Dataflow service.

Parameters (JSON Format)

Defines input parameters to be passed on to job execution, in JSON format (name:value pairs).

This JSON must include the jobname and parameters elements.

Verification Poll Interval (in seconds)

(Optional) Determines the number of seconds to wait before checking the status of the job.

Default: 10

Output Level

Determines one of the following levels of details to retrieve from the GCP outputs in the case of job failure:

  • TRACE

  • DEBUG

  • INFO

  • WARN

  • ERROR

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

Job:GCP Dataproc

Google Cloud Platform (GCP) Dataproc enables you to perform cloud-based big data processing and machine learning.

The following examples show how to define a GCP Dataproc job, which performs cloud-based big data processing and machine learning.

To deploy and run a GCP Dataproc job, ensure that you have done the following:

The following examples show how to define a GCP Dataproc job.

  • This JSON defines a job for a GCP Dataproc task of type Workflow Template:

    Copy
    "Google Dataproc_Job":
    {
       "Type": "Job:GCP Dataproc",
       "ConnectionProfile": "GCPDATAPROC",
       "Project ID": "gcp_projectID",
       "Account Region": "us-central1",
       "Dataproc task type": "Workflow Template",
       "Workflow Template": "Template2",
       "Verification Poll Interval (in seconds)": "20",
       "Tolerance": "2"
    }
  • This JSON defines a job for a Dataproc task of type Job:

    Copy
    "Google Dataproc_Job":
    {
       "Type": "Job:GCP Dataproc",
       "ConnectionProfile": "GCPDATAPROC",
       "Project ID": "gcp_projectID",
       "Account Region": "us-central1",
       "Dataproc task type": "Job",
       "Parameters (JSON Format)":
       {
       "job"
       {
          "placement": {},
          "statusHistory": [],
          "reference":
          {
             "jobId": "job-e241f6be",
             "projectId": "gcp_projectID"
          },
          "labels":
          {
             "goog-dataproc-workflow-instance-id": "44f2b59b-a303-4e57-82e5-e1838019a812",
             "goog-dataproc-workflow-template-id": "template-d0a7c"
          },
          "sparkJob":
          {
             "mainClass": "org.apache.spark.examples.SparkPi",
             "properties": {},
             "jarFileUris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"], 
             "args": ["1000"]
          }
       }
       "Verification Poll Interval (in seconds)": "20",
       "Tolerance": "2"
    }

The following table describes the GCP Dataproc job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:GCP Dataproc name that connects Control-M to GCP Dataproc.

Project ID

Defines the project ID for your Google Cloud project.

Account Region

Defines the Google Compute Engine region to create the job.

Dataproc task type

Defines one of the following Dataproc task types to execute:

  • Workflow Template: A reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.

  • Job: A single Dataproc job.

Workflow Template

(Workflow Template) Defines the ID of a Workflow Template.

Parameters

(Job) Defines input parameters to be passed on to job execution, in JSON format.

You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings.

Verification Poll Interval

(Optional) Determines the number of seconds to wait before checking the status of the job.

Default: 20

Tolerance

Determines the number of times to check the job status before ending Not OK.

Default: 2

Job:Hadoop

The Hadoop job connects to the Hadoop framework, and it enables the distributed processing of large data sets across clusters of commodity servers. You can expand your enterprise business workflows to include tasks that execute in your Big Data Hadoop cluster from Control-M with the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.

Various types of Hadoop jobs are available for you to define using the Job:Hadoop objects:

Job:Hadoop:Spark:Python

To deploy and run a Hadoop Spark Python job, ensure that you have done the following:

The following example shows how to use Job:Hadoop:Spark:Python to run a Spark Python program:

Copy
"ProcessData":
{
   "Type": "Job:Hadoop:Spark:Python",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER"
   "SparkScript": "/home/user/processData.py"
}

The following table describes the Hadoop Spark Python job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines the Hadoop Spark Python job optional parameters:

Copy
"ProcessData1":
{
   "Type": "Job:Hadoop:Spark:Python",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "SparkScript": "/home/user/processData.py",
   "Arguments": ["1000", "120" ],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
   },
   "SparkOptions": [
   {
      "--master": "yarn"
   },
   {
      "--num":"-executors 50"
   } ]
}

The following table describes the Hadoop Spark Python job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Spark:ScalaJava

To deploy and run a Hadoop Scala Java job, ensure that you have done the following:

The following example shows how to use Hadoop Scala Java job to run a Spark Java or Scala program:

Copy
"ProcessData":
{
   "Type": "Job:Hadoop:Spark:ScalaJava",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar": "/home/user/ScalaProgram.jar",
   "MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName"
}

The following table describes the Hadoop Scala Java job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines the Hadoop Scala Java job optional parameters:

Copy
"ProcessData1":
{
   "Type": "Job:Hadoop:Spark:ScalaJava",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar": "/home/user/ScalaProgram.jar"
   "MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName",
   "Arguments": ["1000", "120" ],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands"
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
   },
   "SparkOptions": [
   {
      "--master": "yarn"
   },
   {
      "--num":"-executors 50"
   } ]
}

The following table describes the Hadoop Scala Java job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Pig

To deploy and run a Hadoop Pig job, ensure that you have done the following:

The following example shows how to use Hadoop Pig to run a Pig script:

Copy
"ProcessDataPig":
{
   "Type" : "Job:Hadoop:Pig",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "PigScript" : "/home/user/script.pig"
}

The following table describes the Hadoop Pig job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Pig.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines the Hadoop Pig job optional parameters:

Copy
"ProcessDataPig1"
{
   "Type" : "Job:Hadoop:Pig",
   "ConnectionProfile": "DEV_CLUSTER",
   "PigScript" : "/home/user/script.pig",
   "Host" : "edgenode",
   "Parameters" : [
   {
      "amount":"1000"
   },
   {
      "volume":"120"
   } ],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop Pig job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Sqoop

To deploy and run an Hadoop Scoop job, ensure that you have done the following:

The following example shows how to define a Hadoop Scoop job:

Copy
"LoadDataSqoop":
{
   "Type" : "Job:Hadoop:Sqoop",
   "Host" : "edgenode",
   "ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
   "SqoopCommand" : "import --table foo --target-dir /dest_dir"
}

The following table describes the Hadoop Scoop job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Sqoop.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines the Hadoop Scoop job optional parameters:

Copy
"LoadDataSqoop1" :
{
   "Type" : "Job:Hadoop:Sqoop",
   "Host" : "edgenode",
   "ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
   "SqoopCommand" : "import --table foo",
   "SqoopOptions" : [
   {
      "--warehouse-dir":"/shared"
   },
   {
      "--default-character-set":"latin1"
   } ],
   "SqoopArchives" : "",
   "SqoopFiles": "",
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" :[
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop Scoop job parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

SqoopOptions

Defines the parameters to pass as the specific sqoop tool args.

SqoopArchives

Determines the location of the Hadoop archives.

SqoopFiles

Determines the location of the Sqoop files.

Job:Hadoop:Hive

To deploy and run a Hadoop Hive job, ensure that you have done the following:

The following example shows how to use Hadoop Hive to run a Hive beeline job:

Copy
"ProcessHive":
{
   "Type" : "Job:Hadoop:Hive",
   "Host" : "edgenode",
   "ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
   "HiveScript" : "/home/user1/hive.script"
}

The following table describes the Hadoop Hive job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Hive.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines the Hadoop Hive job optional parameters:

Copy
"ProcessHive1" :
{
   "Type" : "Job:Hadoop:Hive",
   "Host" : "edgenode",
   "ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
   "HiveScript" : "/home/user1/hive.script",
   "Parameters" : [
   {
      "ammount": "1000"
   },
   {
      "topic": "food"
   } ],
   "HiveArchives" : "",
   "HiveFiles": "",
   "HiveOptions" : [
   {
      "hive.root.logger": "INFO,console"
   } ],
   "PreCommands"
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
      } ]
   },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop Hive job parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

HiveSciptParameters

Defines the additional Hadoop command options to pass to beeline as hivevar “name”=”value”.

HiveProperties

Defines the additional Hadoop command options to pass to beeline as hiveconf “key”=”value”.

HiveArchives

Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.archives=”value”.

HiveFiles

Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.files=”value”.

Job:Hadoop:DistCp

To deploy and run a Hadoop Distributed Copy (DistCp) job, a tool used for large inter/intra-cluster copying, ensure that you have done the following:

The following example shows how to define a Hadoop DistCp job:

Copy
"DistCpJob" :
{
   "Type" : "Job:Hadoop:DistCp",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "TargetPath" : "hdfs://nns2:8020/foo/bar",
   "SourcePaths" : ["hdfs://nn1:8020/foo/a"]
}

The following table describes the Hadoop DistCp job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Distributed Copy.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine .

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines the Hadoop DistCp job optional parameters:

Copy
"DistCpJob" :
{
    "Type" : "Job:Hadoop:DistCp",
    "Host" : "edgenode",
    "ConnectionProfile" : "HADOOP_CONNECTION_PROFILE",
    "TargetPath" : "hdfs://nns2:8020/foo/bar",
    "SourcePaths" : ["hdfs://nn1:8020/foo/a", "hdfs://nn1:8020/foo/b" ],
    "DistcpOptions" : [
    {
       "-m":"3"
    },
    {
       "-filelimit ":"100"
    } ]
}

The following table describes the Hadoop DistCp job optional parameters.

Parameter

Description

TargetPath, SourcePaths, and DistcpOptions

Defines the additional Hadoop command options to pass to the distcp tool, as follows:

distcp <Options> <TargetPath> <SourcePaths>.

Job:Hadoop:HDFSCommands

To deploy and run a Hadoop HDFS Commands job, ensure that you have done the following:

The following example shows how to define the Hadoop HDFS job that executes one or more HDFS commands:

Copy
"HdfsJob":
{
   "Type" : "Job:Hadoop:HDFSCommands",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "Commands": [
   {
      "get": "hdfs://nn.example.com/user/hadoop/file localfile"
   },
   {
      "rm": "hdfs://nn.example.com/file /user/hadoop/emptydir"
   } ]
}

The following table describes the Hadoop HDFS Commands job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS Commands.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

Job:Hadoop:HDFSFileWatcher

Hadoop HDFS File Watcher runs a job that waits for HDFS file arrival.

To deploy and run a Hadoop HDFS FileWatcher job, ensure that you have done the following:

The following example shows how to define a Hadoop HDFS File Watcher to run a job that waits for HDFS file arrival:

Copy
"HdfsFileWatcherJob" :
{
   "Type" : "Job:Hadoop:HDFSFileWatcher",
   "Host" : "edgenode",
   "ConnectionProfile" : "DEV_CLUSTER",
   "HdfsFilePath" : "/inputs/filename",
   "MinDetecedSize" : "1",
   "MaxWaitTime" : "2"
}

The following table describes the Hadoop HDFS File Watcher job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS FileWatcher.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

HdfsFilePath

Defines the full path of the file being watched.

MinDetecedSize

Defines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file.

MaxWaitTime

Defines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes.

Job:Hadoop:Oozie

To deploy and run an Hadoop Oozie job, ensure that you have done the following:

The following example shows how to define the Hadoop Oozie to run a job that submits an Oozie workflow:

Copy
"OozieJob":
{
   "Type" : "Job:Hadoop:Oozie",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "JobPropertiesFile" : "/home/user/job.properties",
   "OozieOptions" : [
   {
      "inputDir":"/usr/tucu/inputdir"
   },
   {
      "outputDir":"/usr/tucu/outputdir"
   } ]
}

The following table describes the Hadoop Oozie job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Oozie.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

JobPropertiesFile

Defines the path to the job properties file.

The following table describes the Hadoop Oozie job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

OozieOptions

Determines whether to set or override values for given job property.

Job:Hadoop:MapReduce

To deploy and run a Hadoop MapReduce job, ensure that you have done the following:

The following example shows how to define a Hadoop MapReduce job:

Copy
"MapReduceJob" :
{
   "Type" : "Job:Hadoop:MapReduce",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
   "MainClass" : "pi",
   "Arguments" :["1","2"]
}

The following table describes the Hadoop MapReduce job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

This JSON defines a Hadoop MapReduce optional parameters:

Copy
"MapReduceJob1" :
{
   "Type" : "Job:Hadoop:MapReduce",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
   "MainClass" : "pi",
   "Arguments" :["1","2"],
   "PreCommands":
   {
      "FailJobOnCommandFailure" :false,
      "Commands" : [
      {
         "get" : "hdfs://nn.example.com/user/hadoop/file localfile"
      },
      {
         "rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"} ]
      },
   "PostCommands":
   {
      "FailJobOnCommandFailure" :true,
      "Commands" : [
      {
         "put" : "localfile hdfs://nn.example.com/user/hadoop/file"
      } ]
   }
}

The following table describes the Hadoop MapReduce job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:MapredStreaming

To deploy and run a Hadoop Mapred Streaming job, ensure that you have done the following:

The following example shows how to define a Hadoop Mapred Streaming job:

Copy
"MapredStreamingJob1":
{
   "Type": "Job:Hadoop:MapredStreaming",
   "Host" : "edgenode",
   "ConnectionProfile": "DEV_CLUSTER",
   "InputPath": "/user/robot/input/*",
   "OutputPath": "/tmp/output",
   "MapperCommand": "mapper.py",
   "ReducerCommand": "reducer.py",
   "GeneralOptions": [
   {
      "-D": "fs.permissions.umask-mode=000"
   },
   {
      "-files": "/home/user/hadoop-streaming/mapper.py,/home/user/hadoop-streaming/reducer.py"
   } ]
}

The following table describes the Hadoop Mapred Streaming job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce Streaming.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

The following table describes the Hadoop Mapred Streaming job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

GeneralOptions

Defines the additional Hadoop command options to pass to the hadoop-streaming.jar, including generic options and streaming options.

Job:Hadoop:Tajo:InputFile

To deploy and run a Hadoop Tajo InputFile job, ensure that you have done the following:

The following example shows how to define a Hadoop Tajo InputFile job based on an input file:

Copy
"HadoopTajo_InputFile_Job" :
{
   "Type" : "Job:Hadoop:Tajo:InputFile",
   "ConnectionProfile" : "TAJO_CONNECTION_PROFILE",
   "Host" : "edgenode",
   "FullFilePath" : "/home/user/tajo_command.sh",
   "Parameters" : [
   {
      "amount":"1000"
   },
   {
      "volume":"120"
   } ]
}

The following table describes the Hadoop Tajo InputFile job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Tajo.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

FullFilePath

Defines the full path to the input file used as the Tajo command source.

Parameters

Defines optional parameters for the script, expressed as name:value pairs.

The following table describes the Hadoop Tajo InputFile job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Tajo:Query

To deploy and run a Hadoop Tajo Query job, ensure that you have done the following:

The following example shows how to define a Hadoop Tajo Query job:

Copy
"HadoopTajo_Query_Job":
{
   "Type" : "Job:Hadoop:Tajo:Query",
   "ConnectionProfile" : "TAJO_CONNECTION_PROFILE",
   "Host" : "edgenode",
   "OpenQuery" : "SELECT %%firstParamName AS VAR1 \\n FROM DUMMY \\n ORDER BY \\t VAR1 DESC",
}

The following table describes the Hadoop Tajo Query job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop for name that name that connects Control-M to Apache Tajo.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

OpenQuery

Defines the an ad-hoc query to the Apache Tajo warehouse system.

The following table describes the Hadoop Tajo Query job additional optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Defines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Snowflake

Snowflake is a cloud-computing platform that enables you to process, analyze, and store your data.

To deploy and run an Job:Snowflake job, ensure that you have done the following:

The following example shows how to define a Job:Snowflake job for a SQL Statement action in Snowflake:

Copy
"Snowflake_Job":
{
   "Type": "Job:Snowflake",
   "ConnectionProfile": "SNOWFLAKE_CONNECTION_PROFILE",
   "Database": "FactoryDB",
   "Schema": "Public",
   "Action": "SQL Statement",
   "Snowflake SQL Statement": "Select * From Table1",
   "Statement Timeout": "60",
   "Show More Options": "unchecked",
   "Show Output": "unchecked",
   "Polling Interval": "20"
}

The following table describes the Job:Snowflake job parameters.

Parameter

Action

Description

Connection Profile

All Actions

Defines one of the following connection profile types that connects Control-M to Snowflake:

Database

All Actions

Determines the database that the job uses.

Schema

All Actions

Determines the schema that the job uses.

A schema is an organizational model that describes layout and definition of the fields and tables, and their relationships to each other, in a database.

Action

N/A

Determines one of the following Snowflake actions to perform:

  • SQL Statement: Runs any number of Snowflake-supported SQL commands, such as queries, calling or creating procedures, database maintenance tasks, and creating and editing tables.

  • Copy from Query: Copies a queried database and schema into an existing or new file in cloud storage.

  • Copy from Table: Copies from an existing table.

  • Create Table and Query: Creates a table, populated by a query, in the specified database and schema.

  • Copy into Table: Copies data from a cloud storage location into an existing table in Snowflake.

  • Start or Pause Snowpipe: Starts or pauses an existing Snowpipe.

  • Stored Procedure: Calls an existing procedure and its arguments.

  • Snowpipe Load Status: Monitors the status of a Snowpipe for a set period of time.

Snowflake SQL Statement

SQL Statement

Determines one or more Snowflake-supported SQL commands.

Rule: Must be written in a single line, with strings separated by one space only.

Query to Location

Copy from Query

Defines the cloud storage location.

Query Input

Copy from Query

Defines the query used for copying the data.

Storage Integration

  • Copy from Query

  • Copy from Table

  • Copy into Table

Defines the storage integration object, which stores an Identity and Access Management (IAM) entity and an optional set of blocked cloud storage locations.

Overwrite

  • Copy from Query

  • Copy from Table

Determines whether to overwrite an existing file in the cloud storage, as follows:

  • Yes

  • No

File Format

  • Copy from Query

  • Copy from Table

Determines one of the following file formats for the saved file:

  • JSON

  • CSV

Copy Destination

Copy from Table

Determines where the JSON or CSV file is saved.

You can save to Amazon Web Services, Google Cloud Platform, or Microsoft Azure.

s3://<bucket name>/

From Table

Copy from Table

Defines the name of the copied table.

Create Table Name

Create Table and Query

Defines the name of the new or existing table where the data is queried.

Query

Create Table and Query

Defines the query used for the copied data.

Snowpipe Name

  • Start or Pause Snowpipe

  • Snowpipe Load Status

Defines the name of the Snowpipe.

A Snowpipe loads data from files when they are ready or staged.

Table Name

Copy into Table

Defines the name of the table that the data is copied into.

From Location

Copy into Table

Defines the cloud storage location from where the data is copied, in CSV or JSON format.

s3://location-path/FileName.csv

Start or Pause Snowpipe

Start or Pause Snowpipe

Determines whether to start or pause the Snowpipe, as follows:

  • Start Snowpipe

  • Pause Snowpipe

Stored Procedure Name

Stored Procedure

Defines the name of the stored procedure.

Procedure Argument

Stored Procedure

Defines the value of the argument in the stored procedure.

Table Name

Snowpipe Load Status

Defines the table that is monitored when loaded by the Snowpipe.

Stage Location

Snowpipe Load Status

Defines the cloud storage location.

A stage is a pointer that indicates where data is stored, or staged.

s3://CloudStorageLocation/

Days Back

Snowpipe Load Status

Determines the number of days to monitor the Snowpipe load status.

Status File Cloud Location Path

Snowpipe Load Status

Defines the cloud storage location where a CSV file log is created.

The CSV file log details the load status for each Snowpipe.

Storage Integration

Snowpipe Load Status

Defines the Snowflake configuration for the cloud storage location (as defined in the previous parameter, Status File Cloud Location Path).

S3_INT

Statement Timeout

All Actions

Determines the maximum number of seconds to run the job in Snowflake.

Show More Options

All Actions

Determines whether the following job-defining attributes are displayed:

  • Parameters

  • Role

  • Bindings

  • Warehouse

Parameters

All Actions

Defines Snowflake-provided parameters that let you control how data is presented, as follows.

<"param1":"value1", "param2":"value2">

Role

All Actions

Determines the Snowflake role used for this Snowflake job.

A role is an entity that can be assigned privileges on secure objects. You can be assigned one or more roles from a limited selection.

Bindings

All Actions

Defines the values to bind to the variables used in the Snowflake job, in JSON format. For more information about bindings, see the Snowflake documentation.

The following JSON defines two binding variables:

Copy
"1":

   "type": "FIXED"
   "value": "123" 

"2":

   "type": "TEXT"
   "value": "String" 
}

Warehouse

All Actions

Determines the warehouse used in the Snowflake job.

A warehouse is a cluster of virtual machines that processes a Snowflake job.

Show Output

All Actions

Determines whether to show a full JSON response in the log output.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Status Polling Frequency

All Actions

Determines the number of seconds to wait before checking the status of the job.

Default: 20