Data Processing and Analytics Jobs
The following topics describe job types for data processing and analytics platforms and services:
Job:AWS Athena
AWS Athena enables you to process, analyze, and store your data in the cloud.
To deploy and run an AWS Athena job, ensure that you have installed the AWS Athena plug-in with the provision image command or the provision agent::update command.
The following example shows how to define an AWS Athena job. This JSON-based job executes a SQL-based query:
"AWS Athena_Job_2":
{
"Type": "Job:AWS Athena",
"ConnectionProfile": "AWSATHENA",
"Athena Client Request Token": "aws-athena-client-request-token-%%ORDERID-%%TIME",
"DB Catalog Name": "DB_Catalog_Athena",
"Database Name": "DB_Athena",
"Action": "Query",
"Query": "Select * from Athena_Table",
"Output Location": "s3://{BucketPath}",
"Workgroup": "Primary",
"Add Configurations": "checked",
"S3 ACL Option": "BUCKET_OWNER_FULL_CONTROL",
"Encryption Options": "SSE_KMS",
"KMS Key": "arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst",
"Bucket Owner": "Account_ID",
"Show JSON Output": "unchecked",
"Status Polling Frequency": "10",
"Tolerance": "2"
}
The following table describes the AWS Athena job parameters.
Parameter |
Description |
---|---|
Connection Profile |
Defines the ConnectionProfile:AWS Athena name that connects Control-M to AWS Athena. |
Athena Client Request Token |
Defines a unique ID (idempotency token), which guarantees that the job executes only once. Default: aws-athena-client-request-token-%%ORDERID-%%TIME |
DB Catalog Name |
Defines the name of the group of databases (catalog) that the query references. |
Database Name |
Defines the name of the database that the query references. |
Action |
Determines which of the following queries executes:
|
Query |
Defines the SQL-based query that executes. |
Prepared Query Name |
Defines the name of the predefined query that is stored in the AWS Athena platform. |
Table Name |
Defines the name of the table that is created, which is populated by the results of a query in AWS Athena. |
Unload File Type |
Determines the file format that the query results are saved in, as follows:
|
Output Location |
Defines the AWS S3 bucket path where the file is saved, as follows. s3://<path> AWS Athena automatically generates a filename that incorporates the Query Execution ID, which is a unique ID applied to each query that is executed. |
Workgroup |
Defines the workgroup for this job. Workgroups can consist of users, teams, applications, or workloads, and can set limits on the data that each query or group processes. |
Add Configurations |
Determines whether to add additional job definitions. Valid Values:
Default: unchecked |
S3 ACL Option |
Defines the Amazon S3 canned access control list (ACL), which is a predefined set of grantees and permissions assigned to your stored query results. BUCKET_OWNER_FULL_CONTROL is the only canned ACL that is currently supported in AWS Athena. This setting gives you and the bucket owner full control of the query results. |
Encryption Options |
Determines one of the following ways to encrypt the query results:
|
KMS Key |
(SSE_KMS and CSE_KMS only) Defines the Amazon Resource Name (ARN) of the KMS key. An ARN is a standardized AWS resource address. arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst |
Bucket Owner |
Defines the AWS account ID of the Amazon S3 bucket owner. |
Show JSON Output |
Determines whether to show the full JSON API response in the job output. Valid Values:
Default: unchecked |
Status Polling Frequency |
Determines the number of seconds to wait before checking the status of the job. Default: 10 |
Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Job:AWS Data Pipeline
AWS Data Pipeline is a cloud-based extract, transform, load (ETL) service that enables you to automate the transfer, processing, and storage of your data.
To deploy and run an AWS Data Pipeline job, ensure that you have installed the AWS Data Pipeline plug-in with the provision image command or the provision agent::update command.
The following examples show how to define an AWS Data Pipeline job.
-
This JSON-based job creates a pipeline:
Copy"AWS Data Pipeline_Job":
{
"Type": "Job:AWS Data Pipeline",
"ConnectionProfile": "AWSDATAPIPELINE",
"Action": "Create Pipeline",
"Pipeline Name": "demo-pipeline",
"Pipeline Unique Id": "235136145",
"Parameters":
{
"parameterObjects": [
{
"attributes": [
{
"key": "description",
"stringValue": "S3outputfolder"
} ],
"id": "myS3OutputLoc"
} ],
"parameterValues": [
{
"id": "myShellCmd",
"stringValue": "grep -rc \"GET\" ${INPUT1_STAGING_DIR}/* > ${OUTPUT1_STAGING_DIR}/output.txt"
} ],
"pipelineObjects": [
{
"fields": [
{
"key":"input",
"refValue":"S3InputLocation"
},
{
"key":"stage",
"stringValue":"true"
} ],
"id": "ShellCommandActivityObj",
"name": "ShellCommandActivityObj"
} ]
}
"Trigger Created Pipeline": "checked",
"Status Polling Frequency": "20",
"Failure Tolerance": "3"
} -
This JSON-based job triggers an existing pipeline:
Copy"AWS Data Pipeline_Job":
{
"Type": "Job:AWS Data Pipeline",
"ConnectionProfile": "AWSDATAPIPELINE",
"Action": "Trigger Pipeline",
"Pipeline ID": "df-020488024DNBVFN1S2U",
"Trigger Created Pipeline": "unchecked",
"Status Polling Frequency": "20",
"Failure Tolerance": "3"
}
The following table describes the AWS Data Pipeline job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:AWS Data Pipeline name that connects Control-M to AWS Data Pipeline. |
Action |
Determines one of the following AWS Data Pipeline actions:
|
Pipeline Name |
(Create Pipeline) Defines the name of the new AWS Data Pipeline. |
Pipeline Unique ID |
(Create Pipeline) Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be used again. Valid Values: Any alphanumeric characters. |
Parameters |
(Create Pipeline) Defines the parameter objects, which define the variables, for your AWS Data Pipeline in JSON format. For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference. |
Trigger Created Pipeline |
(Create Pipeline) Determines whether to run (trigger) the newly created AWS Data Pipeline. Valid Values:
This parameter is relevant only for a creation action. For a trigger action, set it to unchecked. |
Pipeline ID |
(Trigger Pipeline) Determines which pipeline to run (trigger). |
Status Polling Frequency |
Determines the number of seconds to wait before checking the status of the Data Pipeline job. Default: 20 |
Failure Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Job:AWS DynamoDB
AWS DynamoDB is a NoSQL database service that enables you to create database tables, execute statements and transactions, export and import data to and from the Amazon S3 storage service.
To deploy and run an AWS DynamoDB job, ensure that you have installed the AWS DynamoDB plug-in with the provision image command or the provision agent::update command.
The following examples show how to define an AWS DynamoDB job.
-
This JSON-based job executes a statement:
Copy"AWS DynamoDB_Execute_Statement":
{
"Type": "Job:AWS DynamoDB",
"ConnectionProfile": "ADY",
"Action": "Execute Statement",
"Run Statement with Parameter": "checked",
"Statement": "Select * From IFteam where Id=? OR Name=?",
"Statement Parameters": "[{\"N\": \"20\"},{\"S\":\"Stas30\"}]"
} -
This JSON-based job executes a transaction:
Copy"AWS DynamoDB_Transaction":
{
"Type": "Job:AWS DynamoDB",
"ConnectionProfile": "ADY",
"Action": "Execute Transaction",
"Transaction Statments": "[%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=17\"%4E },%4E{%4E \"Parameters\": [{\"N\": \"20\"},{\"S\":\"Stas30\"}],%4E\"Statement\": \"Select * From IFteam where Id=18\"%4E]",
"Host": "dba-tlv-wcpg35",
"CreatedBy": "emuser",
"RunAs": "ADY",
"When":
{
"WeekDays": ["NONE"],
"MonthDays": ["ALL"],
"DaysRelation": "OR"
},
"eventsToWaitFor":
{
"Type": "WaitForEvents",
"Events": [
{
"Event": "AWS_DynamoDB_Execute_Statement-TO-AWS_DynamoDB_Transaction"
}]
}
} -
This JSON-based job exports a table to S3:
Copy"AWS DynamoDB_Export":
{
"Type": "Job:AWS DynamoDB",
"ConnectionProfile": "ADY",
"Action": "Export Table To S3",
"Idempotency Token": "5364@#gert423",
"Export Format": "DynamoDB JSON",
"S3 Bucket Name": "stasbucket1",
"S3 Path Prefix": "TestDynmoExport",
"S3 Bucket Owner ID": "122343283363",
"Table ARN": "arn:aws:dynamodb:us-east-1:122343283363:table/IFteam",
"Host": "dba-tlv-wcpg35",
"CreatedBy": "emuser",
"RunAs": "ADY",
"When":
{
"WeekDays": ["NONE"],
"MonthDays": ["ALL"],
"DaysRelation": "OR"
},
"eventsToWaitFor":
{
"Type": "WaitForEvents",
"Events": [
{
"Event": "AWS_DynamoDB_Transaction-TO-AWS_DynamoDB_Export"
}]
}
} -
This JSON-based job imports a table from S3:
CopyAWS DynamoDB_Import":
{
"Type": "Job:AWS DynamoDB",
"ConnectionProfile": "ADY",
"Action": "Import Table from S3",
"Idempotency Token": "5364@#gert423",
"Import Format": "DynamoDB JSON",
"S3 Bucket Name": "stasbucket1",
"S3 Path Prefix": "AWSDynamoDB/01690368915115be3974ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz",
"S3 Bucket Owner ID": "122343283363",
"Table Creation Parameters": "\"AttributeDefinitions\": [%4E {%4E\"AttributeName\": \"Id\",%4E\"AttributeType\": \"N\"%4E}%4E ],%4E\"KeySchema\": [%4E{%4E\"AttributeName\": \"Id\",%4E\"KeyType\": \"HASH\"%4E}%4E],%4E \"BillingMode\": \"PROVISIONED\",%4E\"ProvisionedThroughput\": {%4E\"ReadCapacityUnits\": 1,%4E \"WriteCapacityUnits\": 1%4E}",
"Table Name": "NewTAB",
"Host": "dba-tlv-wcpg35",
"CreatedBy": "emuser",
"RunAs": "ADY",
"When":
{
"WeekDays": ["NONE"],
"MonthDays": ["ALL"],
"DaysRelation": "OR"
},
"eventsToWaitFor":
{
"Type": "WaitForEvents",
"Events": [
{
"Event": "AWS_DynamoDB_Export-TO-AWS_DynamoDB_Import"
}]
}
}
The following table describes the AWS DynamoDB job type attributes.
Attribute |
Action |
Description |
---|---|---|
ConnectionProfile |
All Actions |
Defines the ConnectionProfile:AWS DynamoDB name that connects Control-M to AWS DynamoDB. |
Action |
All Actions |
Determines one of the following AWS DynamoDB actions to perform:
|
Run Statement with Parameter |
Execute Statement |
Determines whether to execute the statement with your own JSON parameters. Valid Values:
Default: unchecked |
Statement |
Execute Statement |
Defines one or more the PartiQL statement that are supported by AWS DynamoDB. |
Statement Parameters |
Execute Statement |
Defines the parameters for the AWS DynamoDB job, in JSON format, that enable you to control how the job executes, as appears in the following example: Copy
|
Transaction Statements |
Execute Transaction |
Defines one or more PartiQL transaction statements, as appears in the following example:
Copy
|
Idempotency Token |
|
Defines the unique ID (idempotency token) that guarantees the job is executed only once. After successful execution, this ID cannot be used again. |
Export Format |
Export Job to S3 Bucket |
Determines one of the following formats to export data:
|
Import Format |
Import Job from S3 Bucket |
Determines one of the following formats of the source data:
|
S3 Bucket Name |
|
Defines the Amazon S3 bucket name to export and import to and from the table. |
S3 Path Prefix |
|
Defines the Amazon S3 bucket prefix to use as the filename and path of the table. AWSDynamoDB/01654668915125-be3574ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz |
S3 Bucket Owner ID |
|
Defines the ID of the AWS account that owns the bucket. |
Table ARN |
|
Defines the Amazon Resource Name (ARN) associated with the table to export. |
Import Compression Type |
Import Job from S3 Bucket |
Determines one of the following compression types to compress the data from the imported table:
|
Table Creation Parameters |
Import Job from S3 Bucket |
Defines the name of the new table where the data is imported, as appears in the following example: Copy
|
Table Name |
Import Job from S3 Bucket |
Defines the name of the new table where the data is imported. |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the job. Default: 20 |
Failure Tolerance |
|
Determines the number of times to check the job status before ending Not OK. Default: 0 |
Job:AWS EMR
Amazon Web Services (AWS) EMR is a managed cluster platform that enables you to execute big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.
To deploy and run an AWS EMR job, ensure that you have installed the AWS EMR plug-in with the provision image command or the provision agent::update command.
The following example shows how to define an AWS EMR job:
"AWS EMR_Job_2":
{
"Type": "Job:AWS EMR",
"ConnectionProfile": "AWS_EMR",
"Cluster ID": "j-21PO60WBW77GX",
"Notebook ID": "e-DJJ0HFJKU71I9DWX8GJAOH734",
"Relative Path": "ShowWaitingAndRunningClusters.ipynb",
"Notebook Execution Name": "TestExec",
"Service Role": "EMR_Notebooks_DefaultRole",
"Use Advanced JSON Format": "unchecked",
}
The following table describes the AWS EMR job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:AWS EMR name that connects Control-M to AWS EMR. |
Cluster ID |
Defines the name of the AWS EMR cluster that connects to the Notebook. In the EMR API, the cluster ID is also known as the Execution Engine ID. |
Notebook ID |
Determines which Notebook ID executes the script. In the EMR API, the Notebook ID is also known as the Editor ID. |
Relative Path |
Defines the full directory path and filename of the script in the Notebook. |
Notebook Execution Name |
Defines the job execution name. |
Service Role |
Defines the service role that connects to the Notebook. |
Use Advanced JSON Format |
Determines whether to provide Notebook execution information through JSON code. Valid Values:
Default: unchecked If you set this parameter to checked, the JSON Body parameter replaces several other parameters discussed above (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role). |
JSON Body |
Defines Notebook execution settings in JSON format. For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference. JSON Body is relevant only if you set Use Advanced JSON Format to checked. Copy
|
Job:Azure Databricks
Azure Databricks is a cloud-based data analytics platform that enables you to process and analyze large workloads of data.
To deploy and run an Azure Databricks job, ensure that you have installed the Azure Databricks plug-in with the provision image command or the provision agent::update command.
The following example shows how to define an Azure Databricks job:
"Azure Databricks notebook":
{
"Type": "Job:Azure Databricks",
"ConnectionProfile": "AZURE_DATABRICKS",
"Databricks Job ID: "65",
"Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
"Idempotency Token": "Control-M-Idem_%%ORDERID",
"Status Polling Frequency": "30"
}
The following table describes the Azure Databricks job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Azure Databricks name that connects Control-M to Azure Databricks. |
Databricks Job ID |
Defines the job ID created in your Databricks workspace. |
Parameters |
Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type. "notebook_params":<"param1":"val1", "param2":"val2"> "jar_params": ["param1", "param2"] For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided through the Azure Databricks documentation. For no parameters, specify a value of "params": <>. "Parameters": "params": <> |
Idempotency Token |
(Optional) Defines a token to use to rerun job runs that timed out in Databricks. Valid Values:
|
Status Polling Frequency |
(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 30 |
Job:Azure HDInsight
Azure HDInsight enables you to execute an Apache Spark batch job and perform big data analytics.
To deploy and run an Azure HDInsight job, ensure that you have installed the Azure HDInsight plug-in with the provision image command or the provision agent::update command.
The following example shows how to define an Azure HDInsight job:
"Azure HDInsight_Job":
{
"Type": "Job:Azure HDInsight",
"ConnectionProfile": "AZUREHDINSIGHT",
"Parameters": "
{
"file" : "wasb://<BlobStorageContainerName>@<StorageAccountName>.blob.core.windows.net/sample.jar",
"args" : ["arg0", "arg1"],
"className" : "com.sample.Job1",
"driverMemory" : "1G",
"driverCores" : 2,
"executorMemory" : "1G",
"executorCores" : 10,
"numExecutors" : 10
},
"Status Polling Interval": "10",
"Bring job logs to output": "checked"
}
The following table describes the Azure HDInsight job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Azure HDInsight name that connects Control-M to Azure HDInsight. |
Parameters |
Defines parameters to be passed on the Apache Spark application during job execution, in JSON format (name:value pairs). This JSON must include the file and className elements. For more information about common parameters, see Batch Job in the Azure HDInsight documentation. |
Status Polling Interval |
Defines the number of seconds to wait before verification of the Apache Spark batch job. Default: 10 |
Bring job logs to output |
Determines whether logs from Apache Spark are shown in the job output. Valid Values:
Default: unchecked |
Job:Azure Synapse
Azure Synapse Analytics enables you to perform data integration and big data analytics.
To deploy and run an Azure Synapse job, ensure that you have installed the Azure Synapse plug-in with the provision image command or the provision agent::update command.
The following example shows how to define an Azure Synapse job:
"Azure Synapse_Job":
{
"Type": "Job:Azure Synapse",
"ConnectionProfile": "AZURE_SYNAPSE",
"Pipeline Name": "ncu_synapse_pipeline",
"Parameters": "{\"periodinseconds\":\"40\", \"param2\":\"val2\"}",
"Status Polling Interval": "20"
}
The following table describes the Azure Synapse job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Azure Synapse name that connects Control-M to Azure Synapse. |
Pipeline Name |
Defines the name of a pipeline that you defined in your Azure Synapse workspace. |
Parameters |
Defines pipeline parameters to override when the job runs, defined in JSON format as pairs of name and value, as follows. <\"param1\":\"val1\", \"param2\":\"val2\"> For no parameters, specify <>. |
Status Polling Interval |
(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 20 |
Job:Databricks
Databricks enables you to integrate jobs created in the Databricks environment with your existing Control-M workflows.
To deploy and run a Databricks job, ensure that you have installed the Databricks plug-in with the provision image command or the provision agent::update command.
The following example shows how to define a Databricks job:
"Databricks_Job":
{
"Type": "Job:Databricks",
"ConnectionProfile": "DATABRICKS",
"Databricks Job ID": "91",
"Parameters": "\"notebook_params\":{\"param1\":\"val1\", \"param2\":\"val2\"}",
"Idempotency Token": "Control-M-Idem_%%ORDERID",
"Status Polling Frequency": "30"
}
The following table describes the Databricks job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Databricks name that connects Control-M to Databricks. |
Databricks Job ID |
Determines the job ID created in your Databricks workspace. |
Parameters |
Defines task parameters to override when the job runs, according to the Databricks convention. The list of parameters must begin with the name of the parameter type. "notebook_params":{"param1":"val1", "param2":"val2"} "jar_params": ["param1", "param2"] For more information about the parameter types, review RunParameters properties in the OpenAPI specification provided through the Azure Databricks documentation. For no parameters, specify a value of "params": {}. "Parameters": "params": {} |
Idempotency Token |
(Optional) Defines a token to use to rerun job runs that timed out in Databricks. Valid Values:
Default: Control-M-Idem_%%ORDERID |
Status Polling Frequency |
(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 30 |
Job:DBT
Data Build Tool (dbt) is a cloud-based computing platform that enables you to develop, test, schedule, document, and analyze data models.
To deploy and run a dbt job, ensure that you have installed the dbt plug-in with the provision image command or the provision agent::update command.
The following example shows how to define a dbt job:
"DBT_Job_2":
{
"Type": "Job:DBT",
"ConnectionProfile": "DBT_CP",
"DBT Job Id": "12345",
"Run Comment": "A DBT job",
"Override Job Commands": "checked",
"Variables": [
{
"UCM-DefineCommands-N001-element": "dbt test"
},
{
"UCM-DefineCommands-N002-element": "dbt run"
} ],
"Status Polling Frequency": "10",
"Failure Tolerance": "2"
}
The following table describes the dbt job parameters.
Parameter |
Description |
---|---|
Connection Profile |
Defines the ConnectionProfile:DBT name that connects Control-M to dbt. |
DBT Job ID |
Defines the ID of the preexisting job in the dbt platform that you want to run. |
Run Comment |
Defines a free-text description of the job. |
Override Job Commands |
Determines whether to override the predefined dbt job commands. Valid Values:
Default: unchecked |
Variables |
Defines the new dbt job commands as variable pairs, as follows: "UCM-DefineCommands-Nnnn-element": "command string" where nnn is a counter for the sequential position of each command. |
Status Polling Frequency |
Determines the number of seconds to wait before checking the status of the job. Default: 10 |
Failure Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Job:GCP BigQuery
Google Cloud Platform (GCP) BigQuery is a cloud-computing platform that enables you to process, analyze, and store your data.
To deploy and run a GCP BigQuery job, ensure that you have installed the GCP BigQuery plug-in with the provision image command or the provision agent::update command.
The following example shows how to define a GCP BigQuery job for a Query action in GCP BigQuery:
"GCP BigQuery_query":
{
"Type": "Job:GCP BigQuery",
"ConnectionProfile": "BIGQSA",
"Action": "Query",
"Project Name": "proj",
"Dataset Name": "Test",
"Run Select Query and Copy to Table": "checked",
"Table Name": "IFTEAM",
"SQL Statement": "select user from IFTEAM2",
"Query Parameters":
{
"name": "IFteam",
"paramterType":
{
"type": "STRING"
},
"parameterValue":
{
"value": "BMC"
}
},
"Job Timeout": "30000",
"Connection Timeout": "10",
"Status Polling Frequency": "5"
}
The following table describes the GCP BigQuery job parameters.
Parameter |
Action |
Description |
---|---|---|
ConnectionProfile |
All Actions |
Defines the ConnectionProfile:GCP BigQuery name that connects Control-M to GCP BigQuery. |
Action |
N/A |
Determines one of the following GCP BigQuery actions to perform:
|
Project Name |
All Actions |
Determines the project that the job uses. |
Dataset Name |
|
Determines the database that the job uses. |
Run Select Query and Copy to Table |
Query |
(Optional) Determines whether to paste the results of a SELECT statement into a new table. |
Table Name |
|
Defines the new table name. |
SQL Statement |
Query |
Defines one or more SQL statements supported by GCP BigQuery. Rule: It must be written in a single line, with character strings separated by one space only. |
Query Parameters |
Query |
Defines the query parameters, which enables you to control the presentation of the data. Copy
|
Copy Operation Type |
Copy |
Determines one of the following copy operations:
|
Source Table Properties |
Copy |
Defines the properties of the table that is cloned, backed up, or copied, in JSON format. You can copy or back up one or more tables at a time. Copy
|
Destination Table Properties |
|
Defines the properties of a new table, in JSON format. Copy
|
Destination/Source Bucket URIs |
|
Defines the source or destination data URI for the table that you are loading or extracting. You can load or extract multiple tables. Rule: Separate elements with ,. "gs://source1_site1/source1.json" |
Show Load Options |
Load |
Determines whether to add more fields to a table that you are loading. |
Load Options |
Load |
Defines additional fields for the table that you are loading. Copy
|
Extract As |
Extract |
Determines one of the following file formats to export the data to:
|
Routine |
Routine |
Defines a routine and the values that it must run. Call new_r(‘value1’) |
Job Timeout |
All Actions |
Determines the maximum number of milliseconds to run the GCP BigQuery job. |
Connection Timeout |
All Actions |
Determines the number of seconds to wait before the job ends Not OK. Default: 10 |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the job. Default: 5 |
Job:GCP DataFlow
Google Cloud Platform (GCP) Dataflow enables you to perform cloud-based data processing for batch and real-time data streaming applications.
To deploy and run a GCP Dataflow job, ensure that you have installed the GCP Dataflow plug-in with the provision image command or the provision agent::update command.
The following example shows how to define a GCP Dataflow job:
"Google DataFlow_Job_1":
{
"Type": "Job:GCP DataFlow",
"ConnectionProfile": "GCPDATAFLOW",
"Project ID": "applied-lattice-11111",
"Location": "us-central1",
"Template Type": "Classic Template",
"Template Location (gs://)": "gs://dataflow-templates-us-central1/latest/Word_Count",
"Parameters (JSON Format)":
{
"jobName": "wordcount",
"parameters":
{
"inputFile": "gs://dataflow-samples/shakespeare/kinglear.txt",
"output": "gs://controlmbucket/counts"
}
}
"Verification Poll Interval (in seconds)": "10",
"output Level": "INFO",
"Host": "host1"
}
The following table describes the GCP Dataflow job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:GCP DataFlow name that connects Control-M to GCP DataFlow. |
Project ID |
Defines the project ID for your Google Cloud project. |
Location |
Defines the Google Compute Engine region to create the job. |
Template Type |
Defines one of the following types of GCP Dataflow templates:
|
Template Location (gs://) |
Defines the path for temporary files. This must be a valid Google Cloud Storage URL that begins with gs://. The default pipeline option tempLocation is used if it has been set in the GCP Dataflow service. |
Parameters (JSON Format) |
Defines input parameters to be passed on to job execution, in JSON format (name:value pairs). This JSON must include the jobname and parameters elements. |
Verification Poll Interval (in seconds) |
(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 10 |
Output Level |
Determines one of the following levels of details to retrieve from the GCP outputs in the case of job failure:
|
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
Job:GCP Dataproc
Google Cloud Platform (GCP) Dataproc enables you to perform cloud-based big data processing and machine learning.
The following examples show how to define a GCP Dataproc job, which performs cloud-based big data processing and machine learning.
To deploy and run a GCP Dataproc job, ensure that you have installed the AWS Batch plug-in with the provision image command or the provision agent::update command.
The following examples show how to define a GCP Dataproc job.
-
This JSON defines a job for a GCP Dataproc task of type Workflow Template:
Copy"Google Dataproc_Job":
{
"Type": "Job:GCP Dataproc",
"ConnectionProfile": "GCPDATAPROC",
"Project ID": "gcp_projectID",
"Account Region": "us-central1",
"Dataproc task type": "Workflow Template",
"Workflow Template": "Template2",
"Verification Poll Interval (in seconds)": "20",
"Tolerance": "2"
} -
This JSON defines a job for a Dataproc task of type Job:
Copy"Google Dataproc_Job":
{
"Type": "Job:GCP Dataproc",
"ConnectionProfile": "GCPDATAPROC",
"Project ID": "gcp_projectID",
"Account Region": "us-central1",
"Dataproc task type": "Job",
"Parameters (JSON Format)":
{
"job":
{
"placement": {},
"statusHistory": [],
"reference":
{
"jobId": "job-e241f6be",
"projectId": "gcp_projectID"
},
"labels":
{
"goog-dataproc-workflow-instance-id": "44f2b59b-a303-4e57-82e5-e1838019a812",
"goog-dataproc-workflow-template-id": "template-d0a7c"
},
"sparkJob":
{
"mainClass": "org.apache.spark.examples.SparkPi",
"properties": {},
"jarFileUris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"args": ["1000"]
}
}
"Verification Poll Interval (in seconds)": "20",
"Tolerance": "2"
}
The following table describes the GCP Dataproc job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:GCP Dataproc name that connects Control-M to GCP Dataproc. |
Project ID |
Defines the project ID for your Google Cloud project. |
Account Region |
Defines the Google Compute Engine region to create the job. |
Dataproc task type |
Defines one of the following Dataproc task types to execute:
|
Workflow Template |
(Workflow Template) Defines the ID of a Workflow Template. |
Parameters |
(Job) Defines input parameters to be passed on to job execution, in JSON format. You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings. |
Verification Poll Interval |
(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 20 |
Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Job:Hadoop
The Hadoop job connects to the Hadoop framework, and it enables the distributed processing of large data sets across clusters of commodity servers. You can expand your enterprise business workflows to include tasks that execute in your Big Data Hadoop cluster from Control-M with the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.
To deploy and run an Hadoop jobs, ensure that you have done the following:
-
Installed the Application Pack, which includes the Control-M for Hadoop plug-in.
-
Created the appropriate type of Hadoop connection profile, as described in ConnectionProfile:Hadoop.
Various types of Hadoop jobs are available for you to define using the Job:Hadoop objects:
-
Job:Hadoop:DistCp (distributed copy)
Job:Hadoop:Spark:Python
The following example shows how to use Job:Hadoop:Spark:Python to run a Spark Python program:
"ProcessData":
{
"Type": "Job:Hadoop:Spark:Python",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"SparkScript": "/home/user/processData.py"
}
The following table describes the Hadoop Spark Python job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
This JSON defines the Hadoop Spark Python job optional parameters:
"ProcessData1":
{
"Type": "Job:Hadoop:Spark:Python",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"SparkScript": "/home/user/processData.py",
"Arguments": ["1000", "120" ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
},
"SparkOptions": [
{
"--master": "yarn"
},
{
"--num":"-executors 50"
} ]
}
The following table describes the Hadoop Spark Python job optional parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
Job:Hadoop:Spark:ScalaJava
The following example shows how to use Hadoop Scala Java job to run a Spark Java or Scala program:
"ProcessData":
{
"Type": "Job:Hadoop:Spark:ScalaJava",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar": "/home/user/ScalaProgram.jar",
"MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName"
}
The following table describes the Hadoop Scala Java job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Spark. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
This JSON defines the Hadoop Scala Java job optional parameters:
"ProcessData1":
{
"Type": "Job:Hadoop:Spark:ScalaJava",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar": "/home/user/ScalaProgram.jar"
"MainClass" : "com.mycomp.sparkScalaProgramName.mainClassName",
"Arguments": ["1000", "120" ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [{"put" : "localfile hdfs://nn.example.com/user/hadoop/file"}]
},
"SparkOptions": [
{
"--master": "yarn"
},
{
"--num":"-executors 50"
} ]
}
The following table describes the Hadoop Scala Java job optional parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
Job:Hadoop:Pig
The following example shows how to use Hadoop Pig to run a Pig script:
"ProcessDataPig":
{
"Type" : "Job:Hadoop:Pig",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"PigScript" : "/home/user/script.pig"
}
The following table describes the Hadoop Pig job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Pig. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
This JSON defines the Hadoop Pig job optional parameters:
"ProcessDataPig1":
{
"Type" : "Job:Hadoop:Pig",
"ConnectionProfile": "DEV_CLUSTER",
"PigScript" : "/home/user/script.pig",
"Host" : "edgenode",
"Parameters" : [
{
"amount":"1000"
},
{
"volume":"120"
} ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}
The following table describes the Hadoop Pig job optional parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
Job:Hadoop:Sqoop
The following example shows how to define a Hadoop Scoop job:
"LoadDataSqoop":
{
"Type" : "Job:Hadoop:Sqoop",
"Host" : "edgenode",
"ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
"SqoopCommand" : "import --table foo --target-dir /dest_dir"
}
The following table describes the Hadoop Scoop job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Sqoop. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
This JSON defines the Hadoop Scoop job optional parameters:
"LoadDataSqoop1" :
{
"Type" : "Job:Hadoop:Sqoop",
"Host" : "edgenode",
"ConnectionProfile" : "SQOOP_CONNECTION_PROFILE",
"SqoopCommand" : "import --table foo",
"SqoopOptions" : [
{
"--warehouse-dir":"/shared"
},
{
"--default-character-set":"latin1"
} ],
"SqoopArchives" : "",
"SqoopFiles": "",
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" :[
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}
The following table describes the Hadoop Scoop job parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
SqoopOptions |
Defines the parameters to pass as the specific sqoop tool args. |
SqoopArchives |
Determines the location of the Hadoop archives. |
SqoopFiles |
Determines the location of the Sqoop files. |
Job:Hadoop:Hive
The following example shows how to use Hadoop Hive to run a Hive beeline job:
"ProcessHive":
{
"Type" : "Job:Hadoop:Hive",
"Host" : "edgenode",
"ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
"HiveScript" : "/home/user1/hive.script"
}
The following table describes the Hadoop Hive job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Hive. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
This JSON defines the Hadoop Hive job optional parameters:
"ProcessHive1" :
{
"Type" : "Job:Hadoop:Hive",
"Host" : "edgenode",
"ConnectionProfile" : "HIVE_CONNECTION_PROFILE",
"HiveScript" : "/home/user1/hive.script",
"Parameters" : [
{
"ammount": "1000"
},
{
"topic": "food"
} ],
"HiveArchives" : "",
"HiveFiles": "",
"HiveOptions" : [
{
"hive.root.logger": "INFO,console"
} ],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}
The following table describes the Hadoop Hive job parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
HiveSciptParameters |
Defines the additional Hadoop command options to pass to beeline as hivevar “name”=”value”. |
HiveProperties |
Defines the additional Hadoop command options to pass to beeline as hiveconf “key”=”value”. |
HiveArchives |
Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.archives=”value”. |
HiveFiles |
Defines the additional Hadoop command options to pass to beeline as hiveconf mapred.cache.files=”value”. |
Job:Hadoop:DistCp
The Hadoop Distributed Copy (DistCp) job is used for large inter/intra-cluster copying.
The following example shows how to define a Hadoop DistCp job:
"DistCpJob" :
{
"Type" : "Job:Hadoop:DistCp",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"TargetPath" : "hdfs://nns2:8020/foo/bar",
"SourcePaths" : ["hdfs://nn1:8020/foo/a"]
}
The following table describes the Hadoop DistCp job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Distributed Copy. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine . |
This JSON defines the Hadoop DistCp job optional parameters:
"DistCpJob" :
{
"Type" : "Job:Hadoop:DistCp",
"Host" : "edgenode",
"ConnectionProfile" : "HADOOP_CONNECTION_PROFILE",
"TargetPath" : "hdfs://nns2:8020/foo/bar",
"SourcePaths" : ["hdfs://nn1:8020/foo/a", "hdfs://nn1:8020/foo/b" ],
"DistcpOptions" : [
{
"-m":"3"
},
{
"-filelimit ":"100"
} ]
}
The following table describes the Hadoop DistCp job optional parameters.
Parameter |
Description |
---|---|
TargetPath, SourcePaths, and DistcpOptions |
Defines the additional Hadoop command options to pass to the distcp tool, as follows: distcp <Options> <TargetPath> <SourcePaths>. |
Job:Hadoop:HDFSCommands
The following example shows how to define the Hadoop HDFS job that executes one or more HDFS commands:
"HdfsJob":
{
"Type" : "Job:Hadoop:HDFSCommands",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"Commands": [
{
"get": "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm": "hdfs://nn.example.com/file /user/hadoop/emptydir"
} ]
}
The following table describes the Hadoop HDFS Commands job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS Commands. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
Job:Hadoop:HDFSFileWatcher
Hadoop HDFS File Watcher runs a job that waits for HDFS file arrival.
The following example shows how to define a Hadoop HDFS File Watcher to run a job that waits for HDFS file arrival:
"HdfsFileWatcherJob" :
{
"Type" : "Job:Hadoop:HDFSFileWatcher",
"Host" : "edgenode",
"ConnectionProfile" : "DEV_CLUSTER",
"HdfsFilePath" : "/inputs/filename",
"MinDetecedSize" : "1",
"MaxWaitTime" : "2"
}
The following table describes the Hadoop HDFS File Watcher job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache HDFS FileWatcher. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
HdfsFilePath |
Defines the full path of the file being watched. |
MinDetecedSize |
Defines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file. |
MaxWaitTime |
Defines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes. |
Job:Hadoop:Oozie
The following example shows how to define the Hadoop Oozie to run a job that submits an Oozie workflow:
"OozieJob":
{
"Type" : "Job:Hadoop:Oozie",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"JobPropertiesFile" : "/home/user/job.properties",
"OozieOptions" : [
{
"inputDir":"/usr/tucu/inputdir"
},
{
"outputDir":"/usr/tucu/outputdir"
} ]
}
The following table describes the Hadoop Oozie job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache Oozie. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
JobPropertiesFile |
Defines the path to the job properties file. |
The following table describes the Hadoop Oozie job optional parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
OozieOptions |
Determines whether to set or override values for given job property. |
Job:Hadoop:MapReduce
The following example shows how to define a Hadoop MapReduce job:
"MapReduceJob" :
{
"Type" : "Job:Hadoop:MapReduce",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
"MainClass" : "pi",
"Arguments" :["1","2"]
}
The following table describes the Hadoop MapReduce job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
This JSON defines a Hadoop MapReduce optional parameters:
"MapReduceJob1" :
{
"Type" : "Job:Hadoop:MapReduce",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"ProgramJar" : "/home/user1/hadoop-jobs/hadoop-mapreduce-examples.jar",
"MainClass" : "pi",
"Arguments" :["1","2"],
"PreCommands":
{
"FailJobOnCommandFailure" :false,
"Commands" : [
{
"get" : "hdfs://nn.example.com/user/hadoop/file localfile"
},
{
"rm" : "hdfs://nn.example.com/file /user/hadoop/emptydir"} ]
},
"PostCommands":
{
"FailJobOnCommandFailure" :true,
"Commands" : [
{
"put" : "localfile hdfs://nn.example.com/user/hadoop/file"
} ]
}
}
The following table describes the Hadoop MapReduce job optional parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
Job:Hadoop:MapredStreaming
The following example shows how to define a Hadoop Mapred Streaming job:
"MapredStreamingJob1":
{
"Type": "Job:Hadoop:MapredStreaming",
"Host" : "edgenode",
"ConnectionProfile": "DEV_CLUSTER",
"InputPath": "/user/robot/input/*",
"OutputPath": "/tmp/output",
"MapperCommand": "mapper.py",
"ReducerCommand": "reducer.py",
"GeneralOptions": [
{
"-D": "fs.permissions.umask-mode=000"
},
{
"-files": "/home/user/hadoop-streaming/mapper.py,/home/user/hadoop-streaming/reducer.py"
} ]
}
The following table describes the Hadoop Mapred Streaming job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop name that connects Control-M to Apache MapReduce Streaming. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. |
The following table describes the Hadoop Mapred Streaming job optional parameters.
Parameter |
Description |
---|---|
PreCommands and PostCommands |
Defines the HDFS commands to perform before and after running the job. For example, you can use them for preparation and cleanup. |
FailJobOnCommandFailure |
Determines whether to ignore failure in the pre- or post- commands. PreCommands: Defaults to true, which ends the job Not OK if a pre-command fails. PostCommands: Defaults to false, which ends the job OK, even when a post-command fails. |
GeneralOptions |
Defines the additional Hadoop command options to pass to the hadoop-streaming.jar, including generic options and streaming options. |
The following table describes the Hadoop Tajo InputFile job parameters.
The following table describes the Hadoop Tajo Query job parameters.
Parameter |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:Hadoop for name that name that connects Control-M to Apache Tajo. |
Host |
Defines the name of the host machine where the job runs. An Agent must be installed on this host. Optionally, you can define a host group instead of a host machine. If this parameter is left blank, the job is submitted for execution on the Control-M Scheduling Server host. |
OpenQuery |
Defines the an ad-hoc query to the Apache Tajo warehouse system. |
Job:OCI Data Flow
Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets.
To deploy and run a OCI Data Flow job, ensure that you have installed the OCI Data Flow plug-in with the provision image command or the provision agent::update command.
The following example shows how to define a OCI Data Flow job:
"OCI Data Flow":
{
"Type": "Job:OCI Data Flow",
"ConnectionProfile": "OCI_DATAFLOW",
"Run Name": "CM test run",
"Compartment OCID": "ocid1.compartment.oc1..aaaaaaaahjoxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"Application OCID": "ocid1.dataflowapplication.oc1.phx.anyhqljrtxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"Additional Run Details": "Yes",
"Run Details Configuration": "
{
\"displayName\":\"run_name\",
\"applicationId\":\"application_ocid\",
\"compartmentId\":\"compartment_ocid\"
}",
"Status Polling Frequency":"60",
"Failure Tolerance":"2"
}
The following table describes the OCI Data Flow job attributes.
Attribute |
Description |
---|---|
ConnectionProfile |
Defines the ConnectionProfile:OCI Data Flow name that connects Control-M to OCI Data Flow. |
Run Name |
Defines the name of a new Run. |
Compartment OCID |
Defines the compartment Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each compartment that is created within the Oracle Data Flow Infrastructure. |
Application OCID |
Defines the application Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each application that is created within the Oracle Data Flow Infrastructure. |
Additional Run Details |
(Optional) Determines whether to add more parameters to the new job run. Valid Values:
Default: No |
Run Details Configuration |
(Optional) Defines specific parameters, in JSON format, that are passed when you create a new Run. For more information about the run parameters, see CreateRunDetails Reference 20200129 in the Oracle Cloud Infrastructure Documentation. Copy
|
Status Polling Frequency |
Determines the number of seconds to wait before checking the job status. Default:60 |
Failure tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Job:Snowflake
Snowflake is a cloud-computing platform that enables you to process, analyze, and store your data.
To deploy and run a Snowflake job, ensure that you have installed the Snowflake plug-in with the provision image command or the provision agent::update command.
The following example shows how to define a Job:Snowflake job for a SQL Statement action in Snowflake:
"Snowflake_Job":
{
"Type": "Job:Snowflake",
"ConnectionProfile": "SNOWFLAKE_CONNECTION_PROFILE",
"Database": "FactoryDB",
"Schema": "Public",
"Action": "SQL Statement",
"Snowflake SQL Statement": "Select * From Table1",
"Statement Timeout": "60",
"Show More Options": "unchecked",
"Show Output": "unchecked",
"Polling Interval": "20"
}
The following table describes the Job:Snowflake job parameters.
Parameter |
Action |
Description |
---|---|---|
Connection Profile |
All Actions |
Defines one of the following connection profile types that connects Control-M to Snowflake: |
Database |
All Actions |
Determines the database that the job uses. |
Schema |
All Actions |
Determines the schema that the job uses. A schema is an organizational model that describes layout and definition of the fields and tables, and their relationships to each other, in a database. |
Action |
N/A |
Determines one of the following Snowflake actions to perform:
|
Snowflake SQL Statement |
SQL Statement |
Determines one or more Snowflake-supported SQL commands. Rule: Must be written in a single line, with strings separated by one space only. |
Query to Location |
Copy from Query |
Defines the cloud storage location. |
Query Input |
Copy from Query |
Defines the query used for copying the data. |
Storage Integration |
|
Defines the storage integration object, which stores an Identity and Access Management (IAM) entity and an optional set of blocked cloud storage locations. |
Overwrite |
|
Determines whether to overwrite an existing file in the cloud storage, as follows:
|
File Format |
|
Determines one of the following file formats for the saved file:
|
Copy Destination |
Copy from Table |
Determines where the JSON or CSV file is saved. You can save to Amazon Web Services, Google Cloud Platform, or Microsoft Azure. s3://<bucket name>/ |
From Table |
Copy from Table |
Defines the name of the copied table. |
Create Table Name |
Create Table and Query |
Defines the name of the new or existing table where the data is queried. |
Query |
Create Table and Query |
Defines the query used for the copied data. |
Snowpipe Name |
|
Defines the name of the Snowpipe. A Snowpipe loads data from files when they are ready or staged. |
Table Name |
Copy into Table |
Defines the name of the table that the data is copied into. |
From Location |
Copy into Table |
Defines the cloud storage location from where the data is copied, in CSV or JSON format. s3://location-path/FileName.csv |
Start or Pause Snowpipe |
Start or Pause Snowpipe |
Determines whether to start or pause the Snowpipe, as follows:
|
Stored Procedure Name |
Stored Procedure |
Defines the name of the stored procedure. |
Procedure Argument |
Stored Procedure |
Defines the value of the argument in the stored procedure. |
Table Name |
Snowpipe Load Status |
Defines the table that is monitored when loaded by the Snowpipe. |
Stage Location |
Snowpipe Load Status |
Defines the cloud storage location. A stage is a pointer that indicates where data is stored, or staged. s3://CloudStorageLocation/ |
Days Back |
Snowpipe Load Status |
Determines the number of days to monitor the Snowpipe load status. |
Status File Cloud Location Path |
Snowpipe Load Status |
Defines the cloud storage location where a CSV file log is created. The CSV file log details the load status for each Snowpipe. |
Storage Integration |
Snowpipe Load Status |
Defines the Snowflake configuration for the cloud storage location (as defined in the previous parameter, Status File Cloud Location Path). S3_INT |
Load SQL File |
Run SQL File |
Defines the full path to the file that contains Snowflake-supported SQL commands. |
Statement Timeout |
All Actions |
Determines the maximum number of seconds to run the job in Snowflake. |
Show More Options |
All Actions |
Determines whether the following job-defining attributes are displayed:
|
Parameters |
All Actions |
Defines Snowflake-provided parameters that let you control how data is presented, as follows. <"param1":"value1", "param2":"value2"> |
Role |
All Actions |
Determines the Snowflake role used for this Snowflake job. A role is an entity that can be assigned privileges on secure objects. You can be assigned one or more roles from a limited selection. |
Bindings |
All Actions |
Defines the values to bind to the variables used in the Snowflake job, in JSON format. For more information about bindings, see the Snowflake documentation. The following JSON defines two binding variables: Copy
|
Warehouse |
All Actions |
Determines the warehouse used in the Snowflake job. A warehouse is a cluster of virtual machines that processes a Snowflake job. |
Show Output |
All Actions |
Determines whether to show a full JSON response in the log output. Valid Values:
Default: unchecked |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the job. Default: 20 |