Data Processing and Analytics Jobs

The following topics describe job types for data processing and analytics platforms and services:

Job:AWS AthenaLink copied to clipboard

AWS Athena enables you to process, analyze, and store your data in the cloud.

To deploy and run an AWS Athena job, ensure that you have installed the AWS Athena plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Amazon Athena.

The following example shows how to define an AWS Athena job. This JSON-based job executes a SQL-based query:

CopyCopied to clipboard
"AWS Athena_Job_2":
{
"Type": "Job:AWS Athena",
"ConnectionProfile": "AWSATHENA",
"Athena Client Request Token": "aws-athena-client-request-token-%%ORDERID-%%TIME",
"DB Catalog Name": "DB_Catalog_Athena",
"Database Name": "DB_Athena",
"Action": "Query",
"Query": "Select * from Athena_Table",
"Output Location": "s3://{BucketPath}",
"Workgroup": "Primary",
"Add Configurations": "checked",
"S3 ACL Option": "BUCKET_OWNER_FULL_CONTROL",
"Encryption Options": "SSE_KMS",
"KMS Key": "arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst",
"Bucket Owner": "Account_ID",
"Show JSON Output": "unchecked",
"Status Polling Frequency": "10",
"Tolerance": "2"
}

The following table describes the AWS Athena job parameters.

Parameter

Description

Connection Profile

Defines the ConnectionProfile:AWS Athena name that connects Control-M to AWS Athena.

Athena Client Request Token

Defines a unique ID (idempotency token), which guarantees that the job executes only once.

Default: aws-athena-client-request-token-%%ORDERID-%%TIME

DB Catalog Name

Defines the name of the group of databases (catalog) that the query references.

Database Name

Defines the name of the database that the query references.

Action

Determines which of the following queries executes:

  • Query: Executes the query that you enter in the Query attribute.

  • Run Prepared Query: Executes a predefined query that is stored in the Amazon Athena platform.

  • Query and Create Table: Executes the query that you enter in the Query attribute and saves the results to a new table.

  • Unload: Executes the query that you enter in the Query attribute and saves the results to a file in an Amazon S3 bucket.

Query

Defines the SQL-based query that executes.

Prepared Query Name

Defines the name of the predefined query that is stored in the Amazon Athena platform.

Table Name

Defines the name of the table that is created, which is populated by the results of a query in Amazon Athena.

Unload File Type

Determines one of the following file formats of the query results:

  • JSON

  • CSV

  • ORC

  • Parquet

  • Avro

  • Text File

Output Location

Defines the AWS S3 bucket path where the file is saved, in the following format:

s3://<path>

Amazon Athena automatically generates a filename that incorporates the Query Execution ID, which is a unique ID applied to each query that is executed.

Workgroup

Defines the workgroup for this job.

Workgroups can consist of users, teams, applications, or workloads, and they can set limits on the data that each query or group processes.

Add Configurations

Determines whether to add additional job definitions.

Valid Values:

  • checked

  • unchecked

Default: unchecked

S3 ACL Option

Defines the Amazon S3 canned access control list (ACL), which is a predefined set of grantees and permissions assigned to your stored query results.

BUCKET_OWNER_FULL_CONTROL is the only canned ACL that is currently supported in Amazon Athena. This setting gives you and the bucket owner full control of the query results.

Encryption Options

Determines one of the following ways to encrypt the query results:

  • SSE_S3: Encrypts the data in the Amazon S3 with Server-Side Encryption (SSE) and Amazon S3-managed encryption keys.

  • SSE_KMS: Encrypts the data in the Amazon S3 with SSE and the AWS Key Management Service (KMS), which enables you to manage the encryption keys.

  • CSE_KMS: Encrypts the data in the Amazon S3 object storage with SSE and enables you to provide your own encryption keys.

KMS Key

(SSE_KMS and CSE_KMS only) Defines the Amazon Resource Name (ARN) of the KMS key.

An ARN is a standardized AWS resource address.

arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst

Bucket Owner

Defines the AWS account ID of the Amazon S3 bucket owner.

Show JSON Output

Determines whether to show the full JSON API response in the job output.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Status Polling Frequency

Determines the number of seconds to wait before checking the job status.

Default: 10

Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:AWS Data PipelineLink copied to clipboard

AWS Data Pipeline is a cloud-based extract, transform, load (ETL) service that enables you to automate the transfer, processing, and storage of your data.

To deploy and run an AWS Data Pipeline job, ensure that you have installed the AWS Data Pipeline plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for AWS Data Pipeline.

The following examples show how to define an AWS Data Pipeline job.

  • This JSON-based job creates a pipeline:

    CopyCopied to clipboard
    "AWS Data Pipeline_Job":
    {
    "Type": "Job:AWS Data Pipeline",
    "ConnectionProfile": "AWSDATAPIPELINE",
    "Action": "Create Pipeline",
    "Pipeline Name": "demo-pipeline",
    "Pipeline Unique Id": "235136145",
    "Parameters":
    {
    "parameterObjects": [
    {
    "attributes": [
    {
    "key": "description",
    "stringValue": "S3outputfolder"
    } ],
    "id": "myS3OutputLoc"
    } ],
    "parameterValues": [
    {
    "id": "myShellCmd",
    "stringValue": "grep -rc \"GET\" ${INPUT1_STAGING_DIR}/* > ${OUTPUT1_STAGING_DIR}/output.txt"
    } ],
    "pipelineObjects": [
    {
    "fields": [
    {
    "key":"input",
    "refValue":"S3InputLocation"
    },
    {
    "key":"stage",
    "stringValue":"true"
    } ],
    "id": "ShellCommandActivityObj",
    "name": "ShellCommandActivityObj"
    } ]
    }
    "Trigger Created Pipeline": "checked",
    "Status Polling Frequency": "20",
    "Failure Tolerance": "3"
    }
  • This JSON-based job triggers an existing pipeline:

    CopyCopied to clipboard
    "AWS Data Pipeline_Job":
    {
    "Type": "Job:AWS Data Pipeline",
    "ConnectionProfile": "AWSDATAPIPELINE",
    "Action": "Trigger Pipeline",
    "Pipeline ID": "df-020488024DNBVFN1S2U",
    "Trigger Created Pipeline": "unchecked",
    "Status Polling Frequency": "20",
    "Failure Tolerance": "3"
    }

The following table describes the AWS Data Pipeline job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:AWS Data Pipeline name that connects Control-M to AWS Data Pipeline.

Action

Determines one of the following AWS Data Pipeline actions:

  • Trigger Pipeline: Runs an existing AWS Data Pipeline.

  • Create Pipeline: Creates a new AWS Data Pipeline.

Pipeline Name

(Create Pipeline) Defines the name of the new AWS Data Pipeline.

Pipeline Unique ID

(Create Pipeline) Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be reused.

Valid Values: Any alphanumeric characters.

Parameters

(Create Pipeline) Defines the parameter objects, which define the variables, for your AWS Data Pipeline in JSON format.

For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference.

Trigger Created Pipeline

(Create Pipeline) Determines whether to run (trigger) the newly created AWS Data Pipeline.

Valid Values:

  • checked

  • unchecked

This parameter is relevant only for a creation action. For a trigger action, set it to unchecked.

Pipeline ID

(Trigger Pipeline) Determines which pipeline to run (trigger).

Status Polling Frequency

Determines the number of seconds to wait before checking the job status.

Default: 20

Failure Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:AWS DynamoDBLink copied to clipboard

AWS DynamoDB is a NoSQL database service that enables you to create database tables, execute statements and transactions, export and import data to and from the Amazon S3 storage service.

To deploy and run an AWS DynamoDB job, ensure that you have installed the AWS DynamoDB plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Amazon DynamoDB.

The following examples show how to define an AWS DynamoDB job.

  • This JSON-based job executes a statement:

    CopyCopied to clipboard
    "AWS DynamoDB_Execute_Statement": 
    {
    "Type": "Job:AWS DynamoDB",
    "ConnectionProfile": "ADY",
    "Action": "Execute Statement",
    "Run Statement with Parameter": "checked",
    "Statement": "Select * From IFteam where Id=? OR Name=?",
    "Statement Parameters": "[{\"N\": \"20\"},{\"S\":\"Stas30\"}]"
    }
  • This JSON-based job executes a transaction:

    CopyCopied to clipboard
    "AWS DynamoDB_Transaction": 
    {
    "Type": "Job:AWS DynamoDB",
    "ConnectionProfile": "ADY",
    "Action": "Execute Transaction",
    "Transaction Statments": "[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]",
    "eventsToWaitFor":
    {
    "Type": "WaitForEvents",
    "Events": [
    {
    "Event": "AWS_DynamoDB_Execute_Statement-TO-AWS_DynamoDB_Transaction"
    }]
    }
    }
  • This JSON-based job exports a table to S3:

    CopyCopied to clipboard
    "AWS DynamoDB_Export": 
    {
    "Type": "Job:AWS DynamoDB",
    "ConnectionProfile": "ADY",
    "Action": "Export Table To S3",
    "Idempotency Token": "5364@#gert423",
    "Export Format": "DynamoDB JSON",
    "S3 Bucket Name": "stasbucket1",
    "S3 Path Prefix": "TestDynmoExport",
    "S3 Bucket Owner ID": "122343283363",
    "Table ARN": "arn:aws:dynamodb:us-east-1:122343283363:table/IFteam",
    "eventsToWaitFor":
    {
    "Type": "WaitForEvents",
    "Events": [
    {
    "Event": "AWS_DynamoDB_Transaction-TO-AWS_DynamoDB_Export"
    }]
    }
    }
  • This JSON-based job imports a table from S3:

    CopyCopied to clipboard
    AWS DynamoDB_Import": 
    {
    "Type": "Job:AWS DynamoDB",
    "ConnectionProfile": "ADY",
    "Action": "Import Table from S3",
    "Idempotency Token": "5364@#gert423",
    "Import Format": "DynamoDB JSON",
    "S3 Bucket Name": "stasbucket1",
    "S3 Path Prefix": "AWSDynamoDB/01690368915115be3974ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz",
    "S3 Bucket Owner ID": "122343283363",
    "Table Creation Parameters": "\"AttributeDefinitions\": [%4E {%4E\"AttributeName\": \"Id\",%4E\"AttributeType\": \"N\"%4E}%4E ],%4E\"KeySchema\": [%4E{%4E\"AttributeName\": \"Id\",%4E\"KeyType\": \"HASH\"%4E}%4E],%4E \"BillingMode\": \"PROVISIONED\",%4E\"ProvisionedThroughput\": {%4E\"ReadCapacityUnits\": 1,%4E \"WriteCapacityUnits\": 1%4E}",
    "Table Name": "NewTAB",
    "eventsToWaitFor":
    {
    "Type": "WaitForEvents",
    "Events": [
    {
    "Event": "AWS_DynamoDB_Export-TO-AWS_DynamoDB_Import"
    }]
    }
    }

The following table describes the AWS DynamoDB job type parameters.

Parameters

Action

Description

ConnectionProfile

All Actions

Defines the ConnectionProfile:AWS DynamoDB name that connects Control-M to AWS DynamoDB.

Action

All Actions

Determines one of the following Amazon DynamoDB actions to perform:

  • Execute Statement

  • Execute Transaction

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Run Statement with Parameter

Execute Statement

Determines whether to execute the statement with your own JSON parameters.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Statement

Execute Statement

Defines one or more PartiQL statements that are supported by Amazon DynamoDB.

Statement Parameters

Execute Statement

Defines the job parameters that enable you to control how the job runs, as appears in the following example:

CopyCopied to clipboard
[{\"N\": \"20\"},{\"S\":\"Stas30\"}]

Transaction Statements

Execute Transaction

Defines one or more PartiQL transaction statements, as appears in the following example:

CopyCopied to clipboard
[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]

Idempotency Token

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the unique ID (idempotency token) that guarantees the job runs only once.

After it runs successfully, this ID cannot be reused.

Export Format

Export Job to S3 Bucket

Determines one of the following formats to export data:

  • DYNAMODB JSON

  • ION

Import Format

Import Job from S3 Bucket

Determines one of the following formats of the source data:

  • CSV

  • DYNAMODB JSON

  • ION

S3 Bucket Name

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the Amazon S3 bucket name to export and import to and from the table.

S3 Path Prefix

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the Amazon S3 bucket prefix to use as the filename and path of the table.

AWSDynamoDB/01654668915125-be3574ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz

S3 Bucket Owner ID

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the ID of the AWS account that owns the bucket.

Table ARN

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Defines the Amazon Resource Name (ARN) associated with the table to export.

Import Compression Type

Import Job from S3 Bucket

Determines one of the following compression types to compress the data from the imported table:

  • GZIP

  • ZSTD

  • No Compression

Table Creation Parameters

Import Job from S3 Bucket

Defines the name of the new table where the data is imported, as appears in the following example:

CopyCopied to clipboard
"Attribute Definitions": [
{
"AttributeName": "Id".
"AttributeType": "N"
}]
"KeySchema": [
{
"AttributeName": "Id".
"KeyType": "HASH"
}]
"BillingMode": "PROVISIONED",
"ProvisionedThroughput":
{
"RealCapacityUnits": 1,
"WriteCapacityUnits": 1
}

Table Name

Import Job from S3 Bucket

Defines the name of the new table where the data is imported.

Status Polling Frequency

All Actions

Determines the number of seconds to wait before checking the job status.

Default: 20

Failure Tolerance

  • Export Job to S3 Bucket

  • Import Job from S3 Bucket

Determines the number of times to check the job status before the job ends Not OK.

Default: 0

Job:AWS EMRLink copied to clipboard

Amazon Web Services (AWS) EMR is a managed cluster platform that enables you to execute big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.

To deploy and run an AWS EMR job, ensure that you have installed the AWS EMR plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Amazon EMR.

The following example shows how to define an AWS EMR job:

CopyCopied to clipboard
"AWS EMR_Job_2":
{
"Type": "Job:AWS EMR",
"ConnectionProfile": "AWS_EMR",
"Cluster ID": "j-21PO60WBW77GX",
"Notebook ID": "e-DJJ0HFJKU71I9DWX8GJAOH734",
"Relative Path": "ShowWaitingAndRunningClusters.ipynb",
"Notebook Execution Name": "TestExec",
"Service Role": "EMR_Notebooks_DefaultRole",
"Use Advanced JSON Format": "unchecked",
}

The following table describes the AWS EMR job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:AWS EMR name that connects Control-M to AWS EMR.

Cluster ID

Defines the name of the Amazon EMR cluster to connect to the Notebook.

In the EMR API, this field is called the Execution Engine ID.

Notebook ID

Determines which Notebook ID executes the script.

In the EMR API, this field is called the Editor ID.

Relative Path

Defines the full directory path and filename of the script in the Notebook.

Notebook Execution Name

Defines the job execution name.

Service Role

Defines the service role that connects to the Notebook.

Use Advanced JSON Format

Determines whether to provide Notebook execution information through JSON code.

Valid Values:

  • checked

  • unchecked

Default: unchecked

If you set this parameter to checked, the JSON Body parameter replaces several other parameters discussed above (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role).

JSON Body

Defines Notebook execution settings in JSON format.

For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference.

JSON Body is relevant only if you set Use Advanced JSON Format to checked.

CopyCopied to clipboard
"EditorId": "e-DJJ0HFJKU71I9DWX8GJAOH734",
"RelativePath": "ShowWaitingAndRunningClustersTest2.ipynb",
"NotebookExecutionName":"Tests",
"ExecutionEngine":
{
"Id": "j-AR2G6DPQSGUB"
},
"ServiceRole": "EMR_Notebooks_DefaultRole"

Job:AWS RedshiftLink copied to clipboard

AWS Redshift is a cloud data warehouse service to handle large-scale data analytics.

To deploy and run an AWS Redshift job, ensure that you have installed the AWS EMR plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Amazon Redshift.

The following example shows how to define an AWS Redshift job to run all Redshift commands:

CopyCopied to clipboard
"AWS Redshift_Job":
{
"Type": "Job:AWS Redshift",
"ConnectionProfile": "REDSHIFT_CCP",
"Load Redshift SQL Statement" : "select * from Redshift_table",
"Actions" : "Redshift SQL Statement",
"Workgroup Name" : "Workgroup_Name",
"Secret Manager ARN" : "Secret_Manager_ARN",
"Database" : "Database_Redshift",
"Status Polling Frequency": "10",
"Failure Tolerance": "2"
}

The following table describes the AWS Redshift job parameters.

Parameter

Action

Description

ConnectionProfile

All

Defines the ConnectionProfile:AWS Redshift name that connects Control-M to AWS Redshift.

Load Redshift SQL Statement

  • Redshift SQL Statement

  • Unload Data Into S3

Defines a query generated in the database as an .sql file.

Actions

All

Determines one of the following Amazon Redshift actions to perform:

  • Redshift SQL Statement: Enables you to run all Redshift commands.

  • Unload Data Into S3: Enables you to move data from Redshift to an S3 bucket in CSV or JSON format.

  • Copy Data Into Redshift: Enables you to copy CSV files from an S3 bucket to a Redshift table.

  • Run Procedure: Enables you to run a stored procedure.

Workgroup Name

All

Defines the workgroup for this job.

Workgroups can consist of users, teams, applications, or workloads, and they can set limits for the data that each query or group processes.

Secret Manager ARN

All

Defines the Amazon Resource Name (ARN) associated with the AWS Secrets Manager, which securely stores and manages the database credentials.

Database

All

Defines the database in Amazon Redshift.

Show Statement Results

Redshift SQL Statement

Determines whether to display the statement results.

S3 Bucket URI

  • Unload Data Into S3

  • Copy Data Into Redshift

Defines the full URI of the S3 bucket that contains the extracted query results.

Defines the full URI of the S3 bucket that contains the extracted query results.

File Format

Unload Data Into S3

Determines one of the following file formats of the file placed in the S3 bucket:

  • csv

  • json

Use IAM Role for S3 Access

  • Unload Data Into S3

  • Copy Data Into Redshift

Determines whether to use an IAM Role to access the S3 bucket.

IAM Role ARN

  • Unload Data Into S3

  • Copy Data Into Redshift

Defines the Amazon Resource Name (ARN) of the AWS IAM Role.

An ARN is a standardized AWS resource address.

The AWS IAM role must be granted read and write privileges to create or update any of the AWS resources that are in the stack.

arn:aws:iam::12345678910:role/AWS-QuickSetup-StackSet-Local-AdministrationRole

Table Name

Copy Data Into Redshift

Defines the name of the new table where the data is imported.

Procedure Name

Run Procedure

Defines the name of an existing procedure in Amazon Redshift.

Procedure Arguments

Run Procedure

Defines the arguments for the procedure that you run.

If you do not add an argument, type ().

Status Polling Frequency

All

(Optional) Determines the number of seconds to wait before checking the job status.

Default: 10

Failure Tolerance

All

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:Azure DatabricksLink copied to clipboard

Azure Databricks is a cloud-based data analytics platform that enables you to process and analyze large workloads of data.

To deploy and run an Azure Databricks job, ensure that you have installed the Azure Databricks plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Azure Databricks.

The following example shows how to define an Azure Databricks job:

CopyCopied to clipboard
"Azure Databricks notebook":
{
"Type": "Job:Azure Databricks",
"ConnectionProfile": "AZURE_DATABRICKS",
"Databricks Job ID: "65",
"Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
"Idempotency Token": "Control-M-Idem_%%ORDERID",
"Status Polling Frequency": "30",
"Failure Tolerance": "1"
}

The following table describes the Azure Databricks job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Azure Databricks name that connects Control-M to Azure Databricks.

Databricks Job ID

Defines the job ID created in a Databricks workspace.

Parameters

Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type.

"notebook_params":<"param1":"val1", "param2":"val2">

"jar_params": ["param1", "param2"]

For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided through the Azure Databricks documentation.

For no parameters, specify a value of "params": {}.

"Parameters": "params": {}

Idempotency Token

(Optional) Defines a token to use to rerun job runs that timed out in Databricks.

Valid Values:

  • Control-M-Idem_%%ORDERID: With this token, upon rerun, Control-M invokes the monitoring of the existing job run in Databricks.

  • Any other value: Replaces the Control-M idempotency token. When you rerun a job using a different token, Databricks creates a new job run with a new unique run ID.

Default: Control-M-Idem_%%ORDERID

Status Polling Frequency

(Optional) Determines the number of seconds to wait before checking the job status.

Default: 30

Failure Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 1

Job:Azure HDInsightLink copied to clipboard

Azure HDInsight enables you to execute an Apache Spark batch job and perform big data analytics.

To deploy and run an Azure HDInsight job, ensure that you have installed the Azure HDInsight plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Azure HDInsight.

The following example shows how to define an Azure HDInsight job:

CopyCopied to clipboard
"Azure HDInsight_Job": 
{
"Type": "Job:Azure HDInsight",
"ConnectionProfile": "AZUREHDINSIGHT",
"Parameters": "
{
"file" : "wasb://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/sample.jar",
"args" : ["arg0", "arg1"],
"className" : "com.sample.Job1",
"driverMemory" : "1G",
"driverCores" : 2,
"executorMemory" : "1G",
"executorCores" : 10,
"numExecutors" : 10
},
"Status Polling Interval": "10",
"Bring job logs to output": "checked"
}

The following table describes the Azure HDInsight job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Azure HDInsight name that connects Control-M to Azure HDInsight.

Parameters

Determines which parameters are passed to the Apache Spark Application when the job runs, in JSON format (name:value pairs).

This JSON must include the file and className elements.

For more information about common parameters, see Batch Job in the Azure HDInsight documentation.

Status Polling Interval

Determines the number of seconds to wait before checking the job status.

Default: 10

Bring job logs to output

Determines whether logs from Apache Spark appear in the job output.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Job:Azure SynapseLink copied to clipboard

Azure Synapse Analytics enables you to perform data integration and big data analytics.

To deploy and run an Azure Synapse job, ensure that you have installed the Azure Synapse plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Azure Synapse.

The following example shows how to define an Azure Synapse job:

CopyCopied to clipboard
"Azure Synapse_Job": 
{
"Type": "Job:Azure Synapse",
"ConnectionProfile": "AZURE_SYNAPSE",
"Pipeline Name": "ncu_synapse_pipeline",
"Parameters": "{\"periodinseconds\":\"40\", \"param2\":\"val2\"}",
"Status Polling Interval": "20"
}

The following table describes the Azure Synapse job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Azure Synapse name that connects Control-M to Azure Synapse.

Pipeline Name

Defines the name of a pipeline that you defined in your Azure Synapse workspace.

Parameters

Defines pipeline parameters to override when the job runs, defined in JSON format as pairs of name and value, as follows.

{\"param1\":\"val1\", \"param2\":\"val2\"}

For no parameters, specify {}.

Status Polling Interval

(Optional) Determines the number of seconds to wait before checking the job status.

Default: 20

Job:DatabricksLink copied to clipboard

Databricks enables you to integrate jobs created in the Databricks environment with your existing Control-M workflows.

To deploy and run a Databricks job, ensure that you have installed the Databricks plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Databricks.

The following example shows how to define a Databricks job:

CopyCopied to clipboard
"Databricks_Job":
{
"Type": "Job:Databricks",
"ConnectionProfile": "DATABRICKS",
"Databricks Job ID": "91",
"Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
"Idempotency Token": "Control-M-Idem_%%ORDERID",
"Status Polling Frequency": "30",
"Failure Tolerance": "2"
}

The following table describes the Databricks job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Databricks name that connects Control-M to Databricks.

Databricks Job ID

Defines the job ID created in a Databricks workspace.

Parameters

Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type.

"notebook_params":{"param1":"val1", "param2":"val2"}  

"jar_params": ["param1", "param2"]

For more information about the parameter types, review RunParameters properties in the OpenAPI specification provided through the Azure Databricks documentation.

For no parameters, specify a value of "params": {}.

"Parameters": "params": {}

Idempotency Token

(Optional) Defines a token to use to rerun job runs that timed out in Databricks.

Valid Values:

  • Control-M-Idem_%%ORDERID: With this token, upon rerun, Control-M invokes the monitoring of the existing job run in Databricks.

  • <Any Other Value>:  Replaces the Control-M idempotency token. When you rerun a job using a different token, Databricks creates a new job run with a new unique run ID.

Default: Control-M-Idem_%%ORDERID

Status Polling Frequency

(Optional) Determines the number of seconds to wait before checking the job status.

Default: 30

Failure Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:DBTLink copied to clipboard

Data Build Tool (dbt) is a cloud-based computing platform that enables you to develop, test, schedule, document, and analyze data models.

To deploy and run a dbt job, ensure that you have installed the dbt plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for dbt.

The following example shows how to define a dbt job:

CopyCopied to clipboard
"DBT_Job_2":
{
"Type": "Job:DBT",
"ConnectionProfile": "DBT_CP",
"DBT Job Id": "12345",
"Run Comment": "A DBT job",
"Override Job Commands": "checked",
"Variables": [
{
"UCM-DefineCommands-N001-element": "dbt test"
},
{
"UCM-DefineCommands-N002-element": "dbt run"
} ],
"Status Polling Frequency": "10",
"Failure Tolerance": "2"
}

The following table describes the dbt job parameters.

Parameter

Description

Connection Profile

Defines the ConnectionProfile:DBT name that connects Control-M to dbt.

DBT Job ID

Defines the ID of the preexisting job in the dbt platform that you want to run.

Run Comment

Defines a free-text description of the job.

Override Job Commands

Determines whether to override the predefined dbt job commands.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Variables

Defines the new dbt job commands as variable pairs, as follows:

"UCM-DefineCommands-Nnnn-element": "command string"

where nnn is a counter for the sequential position of each command.

Status Polling Frequency

Determines the number of seconds to wait before checking the job status.

Default: 10

Failure Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:GCP BigQueryLink copied to clipboard

Google Cloud Platform (GCP) BigQuery is a cloud-computing platform that enables you to process, analyze, and store your data.

To deploy and run a GCP BigQuery job, ensure that you have installed the GCP BigQuery plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for GCP BigQuery.

The following example shows how to define a GCP BigQuery job for a Query action in GCP BigQuery:

CopyCopied to clipboard
"GCP BigQuery_query":
{
"Type": "Job:GCP BigQuery",
"ConnectionProfile": "BIGQSA",
"Action": "Query",
"Project Name": "proj",
"Dataset Name": "Test",
"Run Select Query and Copy to Table": "checked",
"Table Name": "IFTEAM",
"SQL Statement": "select user from IFTEAM2",
"Query Parameters":
{
"name": "IFteam",
"paramterType":
{
"type": "STRING"
},
"parameterValue":
{
"value": "BMC"
}
},
"Job Timeout": "30000",
"Connection Timeout": "10",
"Status Polling Frequency": "5"
}

The following table describes the GCP BigQuery job parameters.

Parameter

Action

Description

ConnectionProfile

All Actions

Defines the ConnectionProfile:GCP BigQuery name that connects Control-M to GCP BigQuery.

Action

 

Determines one of the following GCP BigQuery actions to perform:

  • Query:  Runs one or more SQL statements that are supported by GCP BigQuery.

  • Copy: Creates a copy of an existing table.

  • Load: Loads source data into an existing table.

  • Extract: Exports data from an existing table into Google Cloud Storage.

  • Routine: Runs a stored procedure, table function, or previously defined function.

Project Name

All Actions

Defines the name of the predefined Google Cloud project with configured APIs, authentication information, billing details, and job resources.

Dataset Name

  • Query

  • Extract

  • Routine

Determines the database that the job uses.

Run Select Query and Copy to Table

Query

(Optional) Determines whether to paste the results of a SELECT statement into a new table.

Table Name

  • Query

  • Extract

Defines the new table name.

SQL Statement

Query

Defines one or more  SQL statements supported by GCP BigQuery.

Rule: It must be written in a single line, with character strings separated by one space only.

Query Parameters

Query

Defines the query parameters, which enables you to control the presentation of the data.

CopyCopied to clipboard
"name": "IFteam",
"paramterType":
{
"type": "STRING"
},
"parameterValue":
{
"value": "BMC"
}

Copy Operation Type

Copy

Determines one of the following copy operations:

  • Clone: Creates a copy of a base table that has write access.

  • Snapshot: Creates a read-only copy of a base table.

  • Copy: Creates a copy of a snapshot.

  • Restore:  Creates a writable table from a snapshot.

Source Table Properties

Copy

Defines the properties of the table that is cloned, backed up, or copied, in JSON format.

You can copy or back up one or more tables at a time.

CopyCopied to clipboard
   {
"datasetId": "Test1",
"projectId": "SomeProj1",
"tableId": "IFteam1"
}
{
"datasetId": "Test2",
"projectId": "SomepProj2",
"tableId": "IFteam2"
}

Destination Table Properties

  • Copy

  • Load

Defines the properties of a new table, in JSON format.

CopyCopied to clipboard
{ 
"datasetId": "Test3",
"projectId": "SomeProj3",
"tableId": "IFteam3"
}

Destination/Source Bucket URIs

  • Load

  • Extract

Defines the source or destination data URI for the table that you are loading or extracting.

You can load or extract multiple tables.

Rule: Separate elements with ,.

"gs://source1_site1/source1.json"

Show Load Options

Load

Determines whether to add more fields to a table that you are loading.

Load Options

Load

Defines additional fields for the table that you are loading.

CopyCopied to clipboard
"schema": 
{
"fields": [
{
"name": "name1",
"type": "STRING1"
}
{
"name": "name2",
"type": "STRING2"
}
{
"name": "name3",
"type": "STRING3"
} ]
}

Extract As

Extract

Determines one of the following file formats to export the data:

  • CSV

  • JSON

Routine

Routine

Defines a routine and the values that it must run.

Call new_r(‘value1’)

Job Timeout

All Actions

Determines the maximum number of milliseconds to run the GCP BigQuery job.

Default: 30,000 milliseconds (30 seconds)

Connection Timeout

All Actions

Determines the number of seconds to wait after Control-M initiates a connection request before a timeout occurs.

Default: 10

Status Polling Frequency

All Actions

Determines the number of seconds to wait before checking the job status.

Default: 5

Job:GCP DataFlowLink copied to clipboard

Google Cloud Platform (GCP) Dataflow enables you to perform cloud-based data processing for batch and real-time data streaming applications.

To deploy and run a GCP Dataflow job, ensure that you have installed the GCP Dataflow plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for GCP Dataflow.

The following example shows how to define a GCP Dataflow job:

CopyCopied to clipboard
"Google DataFlow_Job_1":
{
"Type": "Job:GCP DataFlow",
"ConnectionProfile": "GCPDATAFLOW",
"Project ID": "applied-lattice-11111",
"Location": "us-central1",
"Template Type": "Classic Template",
"Template Location (gs://)": "gs://dataflow-templates-us-central1/latest/Word_Count",
"Parameters (JSON Format)":
{
"jobName": "wordcount",
"parameters":
{
"inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt",
"output": "gs://controlmbucket/counts"
}
}
"Verification Poll Interval (in seconds)": "10",
"output Level": "INFO",
"Host": "host1"
}

The following table describes the GCP Dataflow job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:GCP DataFlow name that connects Control-M to GCP DataFlow.

Project ID

Defines the identifier of the GCP project where the job runs.

A project is a set of configuration settings that define the resources the jobs utilize and how they interact with GCP.

Location

Determines the region where the job runs.

us-central1

Template Type

Defines one of the following types of GCP Dataflow templates:

  • Classic Template: Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.

  • Flex Template: Developers package the pipeline into a Docker image and then use the Google Cloud CLI to build and save the Flex Template spec file in Cloud Storage.

Template Location (gs://)

Defines the path for temporary files. This must be a valid Google Cloud Storage URL that begins with gs://.

The default pipeline option tempLocation is used if it has been set in the GCP Dataflow service.

Parameters (JSON Format)

Defines input parameters to be passed on to job execution, in JSON format (name:value pairs).

This JSON must include the jobname and parameters elements.

Verification Poll Interval (in seconds)

(Optional) Determines the number of seconds to wait before checking the job status.

Default: 10

Output Level

Determines one of the following levels of details to retrieve from the GCP logs in the case of job failure:

  • TRACE

  • DEBUG

  • INFO

  • WARN

  • ERROR

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

Job:GCP DataprocLink copied to clipboard

Google Cloud Platform (GCP) Dataproc enables you to perform cloud-based big data processing and machine learning.

The following examples show how to define a GCP Dataproc job, which performs cloud-based big data processing and machine learning.

To deploy and run a GCP Dataproc job, ensure that you have installed the AWS Batch plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for GCP Dataproc.

The following examples show how to define a GCP Dataproc job.

  • This JSON defines a job for a GCP Dataproc task of type Workflow Template:

    CopyCopied to clipboard
    "Google Dataproc_Job":
    {
    "Type": "Job:GCP Dataproc",
    "ConnectionProfile": "GCPDATAPROC",
    "Project ID": "gcp_projectID",
    "Account Region": "us-central1",
    "Dataproc task type": "Workflow Template",
    "Workflow Template": "Template2",
    "Verification Poll Interval (in seconds)": "20",
    "Tolerance": "2"
    }
  • This JSON defines a job for a Dataproc task of type Job:

    CopyCopied to clipboard
    "Google Dataproc_Job":
    {
    "Type": "Job:GCP Dataproc",
    "ConnectionProfile": "GCPDATAPROC",
    "Project ID": "gcp_projectID",
    "Account Region": "us-central1",
    "Dataproc task type": "Job",
    "Parameters (JSON Format)":
    {
    "job":
    {
    "placement": {},
    "statusHistory": [],
    "reference":
    {
    "jobId": "job-e241f6be",
    "projectId": "gcp_projectID"
    },
    "labels":
    {
    "goog-dataproc-workflow-instance-id": "44f2b59b-a303-4e57-82e5-e1838019a812",
    "goog-dataproc-workflow-template-id": "template-d0a7c"
    },
    "sparkJob":
    {
    "mainClass": "org.apache.spark.examples.SparkPi",
    "properties": {},
    "jarFileUris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
    "args": ["1000"]
    }
    }
    "Verification Poll Interval (in seconds)": "20",
    "Tolerance": "2"
    }

The following table describes the GCP Dataproc job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:GCP Dataproc name that connects Control-M to GCP Dataproc.

Project ID

Defines the identifier of the GCP project where the job runs.

A project is a set of configuration settings that define the resources the jobs utilize and how they interact with GCP.

Account Region

Defines the Google Compute Engine region to create the job.

Dataproc task type

Defines one of the following Dataproc task types to execute:

  • Workflow Template: A reusable workflow configuration that defines a graph of jobs with information on where to run those jobs.

  • Job: A single Dataproc job.

Workflow Template

(Workflow Template) Defines the ID of a Workflow Template.

Parameters

(Job) Defines input parameters to be passed on to job execution, in JSON format.

You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings.

Verification Poll Interval

(Optional) Determines the number of seconds to wait before checking the job status.

Default: 20

Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:HadoopLink copied to clipboard

The Hadoop job connects to the Hadoop framework, and it enables the distributed processing of large data sets across clusters of commodity servers. You can expand your enterprise business workflows to include tasks that execute in your Big Data Hadoop cluster from Control-M with the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.

To deploy and run an Hadoop jobs, ensure that you have done the following:

  • Installed the Application Pack, which includes the Control-M for Hadoop plug-in.

  • Created the appropriate type of Hadoop connection profile, as described in ConnectionProfile:Hadoop.

Various types of Hadoop jobs are available for you to define using the Job:Hadoop objects:

Job:Hadoop:Spark:PythonLink copied to clipboard

The following example shows how to use Job:Hadoop:Spark:Python to run a Spark Python program:

CopyCopied to clipboard
"ProcessData":
{
"Type": "Job:Hadoop:Spark:Python",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"SparkScript": "/home/user/processData.py"
}

The following table describes the Hadoop Spark Python job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Spark Python job optional parameters:

CopyCopied to clipboard
"ProcessData1":
{
"Type": "Job:Hadoop:Spark:Python",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"SparkScript": "/home/user/processData.py",
"Arguments": ["1000", "120" ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
},
"SparkOptions": [
{
"--master": "yarn"
},
{
"--num":"-executors 50"
} ]
}

The following table describes the Hadoop Spark Python job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:Spark:ScalaJavaLink copied to clipboard

The following example shows how to use Hadoop Scala Java job to run a Spark Java or Scala program:

CopyCopied to clipboard
"ProcessData":
{
"Type": "Job:Hadoop:Spark:ScalaJava",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar": "/home/user/ScalaProgram.jar",
"MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName"
}

The following table describes the Hadoop Scala Java job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Scala Java job optional parameters:

CopyCopied to clipboard
"ProcessData1":
{
"Type": "Job:Hadoop:Spark:ScalaJava",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar": "/home/user/ScalaProgram.jar"
"MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName",
"Arguments": ["1000", "120" ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
},
"SparkOptions": [
{
"--master": "yarn"
},
{
"--num":"-executors 50"
} ]
}

The following table describes the Hadoop Scala Java job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:PigLink copied to clipboard

The following example shows how to use Hadoop Pig to run a Pig script:

CopyCopied to clipboard
"ProcessDataPig":
{
"Type" : "Job:Hadoop:Pig",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"PigScript" : "/home/user/script.pig"
}

The following table describes the Hadoop Pig job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Pig.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Pig job optional parameters:

CopyCopied to clipboard
"ProcessDataPig1": 
{
"Type" : "Job:Hadoop:Pig",
"ConnectionProfile": "DEV_CLUSTER",
"PigScript" : "/home/user/script.pig",
"Host" : "edgenode",
"Parameters" : [
{
"amount":"1000"
},
{
"volume":"120"
} ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}

The following table describes the Hadoop Pig job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:SqoopLink copied to clipboard

The following example shows how to define a Hadoop Scoop job:

CopyCopied to clipboard
"LoadDataSqoop":
{
"Type" : "Job:Hadoop:Sqoop",
"Host" : "edgenode",
"ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
"SqoopCommand" : "import --table foo --target-dir /dest_dir"
}

The following table describes the Hadoop Scoop job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Sqoop.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Scoop job optional parameters:

CopyCopied to clipboard
"LoadDataSqoop1" :
{
"Type" : "Job:Hadoop:Sqoop",
"Host" : "edgenode",
"ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
"SqoopCommand" : "import --table foo",
"SqoopOptions" : [
{
"--warehouse-dir":"/shared"
},
{
"--default-character-set":"latin1"
} ],
"SqoopArchives" : "",
"SqoopFiles": "",
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" :[
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}

The following table describes the Hadoop Scoop job parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

SqoopOptions

Defines the parameters to pass as the specific sqoop tool args.

SqoopArchives

Determines the location of the Hadoop archives.

SqoopFiles

Determines the location of the Sqoop files.

Job:Hadoop:HiveLink copied to clipboard

The following example shows how to use Hadoop Hive to run a Hive beeline job:

CopyCopied to clipboard
"ProcessHive":
{
"Type" : "Job:Hadoop:Hive",
"Host" : "edgenode",
"ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
"HiveScript" : "/home/user1/hive.script"
}

The following table describes the Hadoop Hive job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Hive.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines the Hadoop Hive job optional parameters:

CopyCopied to clipboard
"ProcessHive1" :
{
"Type" : "Job:Hadoop:Hive",
"Host" : "edgenode",
"ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
"HiveScript" : "/home/user1/hive.script",
"Parameters" : [
{
"ammount": "1000"
},
{
"topic": "food"
} ],
"HiveArchives" : "",
"HiveFiles": "",
"HiveOptions" : [
{
"hive.root.logger": "INFO,console"
} ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}

The following table describes the Hadoop Hive job parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

HiveSciptParameters

Defines the additional Hadoop command options to pass to beeline as hivevar “name”=”value”.

HiveProperties

Defines the additional Hadoop command options to pass to beeline as hiveconf “key”=”value”.

HiveArchives

Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.archives=”value”.

HiveFiles

Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.files=”value”.

Job:Hadoop:DistCpLink copied to clipboard

The Hadoop Distributed Copy (DistCp) job is used for large inter/intra-cluster copying.

The following example shows how to define a Hadoop DistCp job:

CopyCopied to clipboard
"DistCpJob" :
{
"Type" : "Job:Hadoop:DistCp",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"TargetPath" : "hdfs://nns2:8020/foo/bar",
"SourcePaths" : ["hdfs://nn1:8020/foo/a"]
}

The following table describes the Hadoop DistCp job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Distributed Copy.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine .

This JSON defines the Hadoop DistCp job optional parameters:

CopyCopied to clipboard
"DistCpJob" :
{
"Type" : "Job:Hadoop:DistCp",
"Host" : "edgenode",
"ConnectionProfile" : "HADOOP_CONNECTION_PROFILE",
"TargetPath" : "hdfs://nns2:8020/foo/bar",
"SourcePaths" : ["hdfs://nn1:8020/foo/a", "hdfs://nn1:8020/foo/b" ],
"DistcpOptions" : [
{
"-m":"3"
},
{
"-filelimit ":"100"
} ]
}

The following table describes the Hadoop DistCp job optional parameters.

Parameter

Description

TargetPath, SourcePaths, and DistcpOptions

Defines the additional Hadoop command options to pass to the distcp tool, as follows:

distcp <Options> <TargetPath> <SourcePaths>.

Job:Hadoop:HDFSCommandsLink copied to clipboard

The following example shows how to define the Hadoop HDFS job that executes one or more HDFS commands:

CopyCopied to clipboard
"HdfsJob":
{
"Type" : "Job:Hadoop:HDFSCommands",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"Commands": [
{
"get": "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm": "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
}

The following table describes the Hadoop HDFS Commands job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS Commands.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

Job:Hadoop:HDFSFileWatcherLink copied to clipboard

Hadoop HDFS File Watcher runs a job that waits for HDFS file arrival.

The following example shows how to define a Hadoop HDFS File Watcher to run a job that waits for HDFS file arrival:

CopyCopied to clipboard
"HdfsFileWatcherJob" :
{
"Type" : "Job:Hadoop:HDFSFileWatcher",
"Host" : "edgenode",
"ConnectionProfile" : "DEV_CLUSTER",
"HdfsFilePath" : "/inputs/filename",
"MinDetecedSize" : "1",
"MaxWaitTime" : "2"
}

The following table describes the Hadoop HDFS File Watcher job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS FileWatcher.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

HdfsFilePath

Defines the full path of the file being watched.

MinDetecedSize

Defines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file.

MaxWaitTime

Defines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes.

Job:Hadoop:OozieLink copied to clipboard

The following example shows how to define the Hadoop Oozie to run a job that submits an Oozie workflow:

CopyCopied to clipboard
"OozieJob":
{
"Type" : "Job:Hadoop:Oozie",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"JobPropertiesFile" : "/home/user/job.properties",
"OozieOptions" : [
{
"inputDir":"/usr/tucu/inputdir"
},
{
"outputDir":"/usr/tucu/outputdir"
} ]
}

The following table describes the Hadoop Oozie job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Oozie.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

JobPropertiesFile

Defines the path to the job properties file.

The following table describes the Hadoop Oozie job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

OozieOptions

Determines whether to set or override values for given job property.

Job:Hadoop:MapReduceLink copied to clipboard

The following example shows how to define a Hadoop MapReduce job:

CopyCopied to clipboard
"MapReduceJob" :
{
"Type" : "Job:Hadoop:MapReduce",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
"MainClass" : "pi",
"Arguments" :["1","2"]
}

The following table describes the Hadoop MapReduce job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

This JSON defines a Hadoop MapReduce optional parameters:

CopyCopied to clipboard
"MapReduceJob1" :
{
"Type" : "Job:Hadoop:MapReduce",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
"MainClass" : "pi",
"Arguments" :["1","2"],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}

The following table describes the Hadoop MapReduce job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

Job:Hadoop:MapredStreamingLink copied to clipboard

The following example shows how to define a Hadoop Mapred Streaming job:

CopyCopied to clipboard
"MapredStreamingJob1":
{
"Type": "Job:Hadoop:MapredStreaming",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"InputPath": "/user/robot/input/*",
"OutputPath": "/tmp/output",
"MapperCommand": "mapper.py",
"ReducerCommand": "reducer.py",
"GeneralOptions": [
{
"-D": "fs.permissions.umask-mode=000"
},
{
"-files": "/home/user/hadoop-streaming/mapper.py,/home/user/hadoop-streaming/reducer.py"
} ]
}

The following table describes the Hadoop Mapred Streaming job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce Streaming.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

The following table describes the Hadoop Mapred Streaming job optional parameters.

Parameter

Description

PreCommands and PostCommands

Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup.

FailJobOnCommandFailure

Determines whether to ignore failure in the pre- or post- commands.

PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails.

PostCommands: Defaults to false, which ends the job OK, even when a post-command fails.

GeneralOptions

Defines the additional Hadoop command options to pass to the hadoop-streaming.jar, including generic options and streaming options.

The following table describes the Hadoop Tajo InputFile job parameters.

The following table describes the Hadoop Tajo Query job parameters.

Parameter

Description

ConnectionProfile

Defines the ConnectionProfile:Hadoop for name that name that connects Control-M to Apache Tajo.

Host

Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine.

If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host.

OpenQuery

Defines the an ad-hoc query to the Apache Tajo warehouse system.

Job:OCI Data FlowLink copied to clipboard

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets.

To deploy and run a OCI Data Flow job, ensure that you have installed the OCI Data Flow plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for OCI Data Flow.

The following example shows how to define a OCI Data Flow job:

CopyCopied to clipboard
"OCI Data Flow": 
{
"Type": "Job:OCI Data Flow",
"ConnectionProfile": "OCI_DATAFLOW",
"Run Name": "CM test run",
"Compartment OCID": "ocid1.compartment.oc1..aaaaaaaahjoxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"Application OCID": "ocid1.dataflowapplication.oc1.phx.anyhqljrtxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"Additional Run Details": "Yes",
"Run Details Configuration": "
{
\"displayName\":\"run_name\",
\"applicationId\":\"application_ocid\",
\"compartmentId\":\"compartment_ocid\"
}",
"Status Polling Frequency":"60",
"Failure Tolerance":"2"
}

The following table describes the OCI Data Flow job attributes.

Attribute

Description

ConnectionProfile

Defines the ConnectionProfile:OCI Data Flow name that connects Control-M to OCI Data Flow.

Run Name

Defines the name of a new Run.

Compartment OCID

Defines the compartment Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each compartment that is created within the Oracle Data Flow Infrastructure.

Application OCID

Defines the application Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each application that is created within the Oracle Data Flow Infrastructure.

Additional Run Details

(Optional) Determines whether to add more parameters to the new job run.

Valid Values:

  • No

  • Yes

Default: No

Run Details Configuration

(Optional) Defines specific parameters, in JSON format, that are passed when you create a new Run.

For more information about the run parameters, see CreateRunDetails Reference 20200129 in the Oracle Cloud Infrastructure Documentation.

CopyCopied to clipboard
{
"displayName": "<run_name>",
"applicationId": "<application_ocid>",
"compartmentId": "<compartment_ocid>",
"driverShape": "VM.Standard.E4.Flex",
"executorShape": "VM.Standard.E4.Flex",
"numExecutors": 1,
"arguments": [],
"parameters": [],
"configuration": {}
}

Status Polling Frequency

Determines the number of seconds to wait before checking the job status.

Default:60

Failure Tolerance

Determines the number of times to check the job status before the job ends Not OK.

Default: 2

Job:SnowflakeLink copied to clipboard

Snowflake is a cloud-computing platform that enables you to process, analyze, and store your data.

To deploy and run a Snowflake job, ensure that you have installed the Snowflake plug-in with the provision image command or the provision agent::update command.

For more information about this plug-in, see Control-M for Snowflake.

The following example shows how to define a Job:Snowflake job for a SQL Statement action in Snowflake:

CopyCopied to clipboard
"Snowflake_Job":
{
"Type": "Job:Snowflake",
"ConnectionProfile": "SNOWFLAKE_CONNECTION_PROFILE",
"Database": "FactoryDB",
"Schema": "Public",
"Action": "SQL Statement",
"Snowflake SQL Statement": "Select * From Table1",
"Statement Timeout": "60",
"Show More Options": "unchecked",
"Show Output": "unchecked",
"Polling Interval": "20"
}

The following table describes the Job:Snowflake job parameters.

Parameter

Action

Description

Connection Profile

All Actions

Defines one of the following connection profile types that connects Control-M to Snowflake:

Database

All Actions

Determines the database that the job uses.

Schema

All Actions

Determines the schema that the job uses.

A schema is an organizational model that describes the database layout, its table and field definitions, and their relationships to each other.

Action

 

Determines one of the following Snowflake actions to perform:

  • SQL Statement: Runs any number of Snowflake-supported SQL commands, such as queries, calling or creating procedures, database maintenance tasks, and creating and editing tables.

  • Copy from Query: Copies a queried database and schema into an existing or new file in cloud storage.

  • Copy from Table: Copies from an existing table.

  • Create Table and Query: Creates a table, populated by a query, in the specified database and schema.

  • Copy into Table: Copies data from a cloud storage location into an existing table in Snowflake.

  • Start or Pause Snowpipe: Starts or pauses an existing Snowpipe.

  • Stored Procedure: Calls an existing procedure and its arguments.

  • Snowpipe Load Status: Monitors the status of a Snowpipe for a set period of time.

  • Run SQL File: Uploads a file that contains Snowflake-supported SQL commands.

Snowflake SQL Statement

SQL Statement

Determines one or more Snowflake-supported SQL commands.

Rule: Must be written in a single line, with strings separated by one space only.

Query to Location

Copy from Query

Defines the cloud storage location.

Query Input

Copy from Query

Defines the query used to copy the data.

Storage Integration

  • Copy from Query

  • Copy from Table

  • Copy into Table

Defines the storage integration object, which stores an Identity and Access Management (IAM) entity and an optional set of blocked cloud storage locations.

Overwrite

  • Copy from Query

  • Copy from Table

Determines whether to overwrite an existing file in the cloud storage, as follows:

  • Yes

  • No

File Format

  • Copy from Query

  • Copy from Table

Determines one of the following file formats for the saved file:

  • JSON

  • CSV

Copy Destination

Copy from Table

Defines where the JSON or CSV file is saved.

You can save to Amazon Web Services, Google Cloud Platform, or Microsoft Azure.

s3://<bucket name>/

From Table

Copy from Table

Defines the name of the copied table.

Create Table Name

Create Table and Query

Defines the name of the new or existing table where the data is queried.

Query

Create Table and Query

Defines the query used for the copied data.

Snowpipe Name

  • Start or Pause Snowpipe

  • Snowpipe Load Status

Defines the name of the Snowpipe.

A Snowpipe loads data from files when they are ready or staged.

Table Name

Copy into Table

Defines the name of the table that the data is copied into.

From Location

Copy into Table

Defines the cloud storage location from where the data is copied, in CSV or JSON format.

s3://location-path/FileName.csv

Start or Pause Snowpipe

Start or Pause Snowpipe

Determines whether to start or pause the Snowpipe, as follows:

  • Start Snowpipe

  • Pause Snowpipe

Stored Procedure Name

Stored Procedure

Defines the name of the stored procedure.

Procedure Argument

Stored Procedure

Defines the value of the argument in the stored procedure.

Table Name

Snowpipe Load Status

Defines the table that is monitored when loaded by the Snowpipe.

Stage Location

Snowpipe Load Status

Defines the cloud storage location.

A stage is a pointer that indicates where data is stored or staged.

s3://CloudStorageLocation/

Days Back

Snowpipe Load Status

Determines the number of days to monitor the Snowpipe load status.

Status File Cloud Location Path

Snowpipe Load Status

Defines the cloud storage location where a CSV file log is created.

The CSV file log details the load status for each Snowpipe.

Storage Integration

Snowpipe Load Status

Defines the Snowflake configuration for the cloud storage location (as defined in the previous parameter, Status File Cloud Location Path).

S3_INT

Load SQL File

Run SQL File

Defines the full path to the file that contains Snowflake-supported SQL commands.

Statement Timeout

All Actions

Determines the maximum number of seconds to run the job in Snowflake.

Show More Options

All Actions

Determines whether the following job-defining parameters are displayed:

  • Parameters

  • Role

  • Bindings

  • Warehouse

Parameters

All Actions

Defines Snowflake-provided parameters that let you control how data is presented, as follows.

<"param1":"value1", "param2":"value2">

Role

All Actions

Determines the Snowflake role used for this Snowflake job.

A role is an entity that can be assigned privileges on secure objects. You can be assigned one or more roles from a limited selection.

Bindings

All Actions

Defines the values, in JSON format to bind to the variables used in the Snowflake job.

The following JSON script defines two binding variables:

CopyCopied to clipboard
"1": 
{
"type": "FIXED",
"value": "123"
}
"2":
{
"type": "TEXT",
"value": "String"
}

For more information on bindings, see the Snowflake documentation.

Warehouse

All Actions

Determines the warehouse used in the Snowflake job.

A warehouse is a cluster of virtual machines that processes a Snowflake job.

Show Output

All Actions

Determines whether to show a full JSON response in the log output.

Valid Values:

  • checked

  • unchecked

Default: unchecked

Polling Interval

All Actions

Determines the number of seconds to wait before checking the job status.

Default: 20