Data Processing and Analytics Jobs
The following topics describe job attributes that work with data processing platforms and services:
AWS Athena Job
AWS Athena enables you to process, analyze, and store your data in the cloud.
To create an AWS Athena job, see Creating a Job. For more information about this plug-in, see
The following table describes the AWS Athena job attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to AWS Athena, as described in AWS Athena Connection Profile Parameters. Rules:
|
Athena Client Request Token |
Defines a unique ID (idempotency token), which guarantees that the job executes only once. Default: aws-athena-client-request-token-%%ORDERID-%%TIME |
DB Catalog Name |
Defines the name of the group of databases (catalog) that the query references. |
Database Name |
Defines the name of the database that the query references. |
Action |
Determines which of the following queries executes:
|
Query |
Defines the SQL-based query that executes. |
Prepared Query Name |
Defines the name of the predefined query that is stored in the AWS Athena platform. |
Table Name |
Defines the name of the table that is created, which is populated by the results of a query in AWS Athena. |
Unload File Type |
Determines which of the following file formats the query results are saved in:
|
Output Location |
Defines the AWS S3 bucket path where the file is saved, in the following format: s3://<path> AWS Athena automatically generates a filename that incorporates the Query Execution ID, which is a unique ID applied to each query that is executed. |
Workgroup |
Defines the workgroup for this job. Workgroups can consist of users, teams, applications, or workloads, and they can set limits on the data that each query or group processes. |
Add Configurations |
Determines whether to add additional job definitions, as follows:
|
S3 ACL Option |
Defines the Amazon S3 canned access control list (ACL), which is a predefined set of grantees and permissions assigned to your stored query results. BUCKET_OWNER_FULL_CONTROL is the only canned ACL that is currently supported in AWS Athena. This setting gives you and the bucket owner full control of the query results. |
Encryption Options |
Determines one of the following ways to encrypt the query results:
|
KMS Key |
(SSE_KMS and CSE_KMS only) Defines the Amazon Resource Name (ARN) of the KMS key. An ARN is a standardized AWS resource address. arn:aws:kms:us-west-2:123456789012:key/abcd1234-5678-9012-efgh-ijklmnopqrst |
Bucket Owner |
Defines the AWS account ID of the Amazon S3 bucket owner. |
Show JSON Output |
Determines whether to show the full JSON API response in the job output. |
Status Polling Frequency |
Determines the number of seconds to wait before checking the status of the job. Default: 10 |
Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
AWS Data Pipeline Job
AWS Data Pipeline is a cloud-based extract, transform, load (ETL) service that enables you to automate the transfer, processing, and storage of your data.
To create an AWS Data Pipeline job, see Creating a Job. For more information about this plug-in, see
The following table describes the AWS Data Pipeline job attributes.
Attribute |
Action |
Description |
---|---|---|
Connection Profile |
|
Determines the authorization credentials that are used to connect Control-M to AWS Data Pipeline, as described in AWS Data Pipeline Connection Profile Parameters. Rules:
|
Action |
|
Determines one of the following AWS Data Pipeline actions:
|
Pipeline Name |
Create Pipeline |
Defines the name of the new AWS Data Pipeline. |
Pipeline Unique ID |
Create Pipeline |
Defines the unique ID (idempotency key) that guarantees the pipeline is created only once. After successful execution, this ID cannot be used again. Valid Characters: Any alphanumeric characters. |
Parameters |
Create Pipeline |
Defines the parameter objects, in JSON format, which define the variables for your AWS Data Pipeline, as shown in the following example: Copy
For more information about the available parameter objects, see the descriptions of the PutPipelineDefinition and GetPipelineDefinition actions in the AWS Data Pipeline API Reference. |
Trigger Created Pipeline |
Create Pipeline |
Determines whether to execute, or trigger, the newly created AWS Data Pipeline. |
Pipeline ID |
Trigger Pipeline |
Determines which pipeline to execute, or trigger. |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the Data Pipeline job. Default: 20 |
Failure Tolerance |
All Actions |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
AWS DynamoDB Job
AWS DynamoDB is a NoSQL database service that enables you to create database tables, execute statements and transactions, export and import data to and from the Amazon S3 storage service.
To create an AWS DynamoDB job, see Creating a Job. For more information about this plug-in, see Control-M for AWS DynamoDB.
The following table describes the AWS DynamoDB job type attributes.
Attribute |
Action |
Description |
---|---|---|
Connection Profile |
|
Determines the authorization credentials that are used to connect Control-M to AWS DynamoDB, as described in AWS DynamoDB Connection Profile Parameters. Rules:
|
Action |
|
Determines one of the following AWS DynamoDB actions to perform:
|
Run Statement with Parameter |
Execute Statement |
Determines whether to execute the statement with your own JSON parameters. |
Statement |
Execute Statement |
Defines one or more PartiQL statements that are supported by AWS DynamoDB. |
Statement Parameters |
Execute Statement |
Defines the parameters for the AWS DynamoDB job, in JSON format, that enable you to control how the job executes, as appears in the following example: Copy
|
Transaction Statements |
Execute Transaction |
Defines one or more PartiQL transaction statements, in JSON format, as appears in the following example:
Copy
|
Idempotency Token |
|
Defines the unique ID (idempotency token) that guarantees the job is executed only once. After successful execution, this ID cannot be used again. |
Export Format |
Export Job to S3 Bucket |
Determines one of the following formats to export data:
|
Import Format |
Import Job from S3 Bucket |
Determines one of the following formats of the source data:
|
S3 Bucket Name |
|
Defines the Amazon S3 bucket name to export and import to and from the table. |
S3 Path Prefix |
|
Defines the Amazon S3 bucket prefix to use as the filename and path of the table. AWSDynamoDB/01654668915125-be3574ee/data/vejljoqgiqyexew2cxgetylg6u.json.gz |
S3 Bucket Owner ID |
|
Defines the ID of the AWS account that owns the bucket. |
Table ARN |
|
Defines the Amazon Resource Name (ARN) associated with the table to export. |
Import Compression Type |
Import Job from S3 Bucket |
Determines one of the following compression types to compress the data from the imported table:
|
Table Creation Parameters |
Import Job from S3 Bucket |
Defines the name of the new table where the data is imported, as appears in the following example:
Copy
|
Table Name |
Import Job from S3 Bucket |
Defines the name of the new table where the data is imported. |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the job. Default: 20 |
Failure Tolerance |
|
Determines the number of times to check the job status before ending Not OK. Default: 0 |
AWS EMR Job
AWS EMR is a managed cluster platform that enables you to execute big data frameworks, such as Apache Hadoop and Apache Spark, to process and analyze vast amounts of data.
To create an AWS EMR job, see Creating a Job. For more information about this plug-in, see
The following table describes AWS EMR job attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to AWS EMR, as described in AWS EMR Connection Profile Parameters. Rules:
|
Cluster ID |
Defines the name of the AWS EMR cluster to connect to the Notebook. In the EMR API, this field is called the Execution Engine ID. |
Notebook ID |
Determines which Notebook ID executes the script. In the EMR API, this field is called the Editor ID. |
Relative Path |
Defines the full path and name of the script file in the Notebook. |
Notebook Execution Name |
Defines the job execution name. |
Service Role |
Defines the service role to connect to the Notebook. |
Use Advanced JSON Format |
Enables you to provide Notebook execution information through JSON code. This JSON Body parameter replaces the values of the following parameters (Cluster ID, Notebook ID, Relative Path, Notebook Execution Name, and Service Role). |
JSON Body |
Defines Notebook execution settings, in JSON format, as shown in the following example: Copy
For a description of the syntax of this JSON, see the description of StartNotebookExecution in the Amazon EMR API Reference. |
Azure Databricks Job
Azure Databricks is a cloud-based data analytics platform that enables you to process and analyze large workloads of data.
To create an Azure Databricks job, see Creating a Job. For more information about this plug-in, see
The following table describes the Azure Databricks job type attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Azure Databricks, as described in Azure Databricks Connection Profile Parameters. Rules:
|
Databricks Job ID |
Determines the ID of the Azure Databricks job that is created in a Databricks workspace. |
Parameters |
Defines task parameters to override when the job executes, according to the Databricks convention. Your list of parameters must begin with the name of the parameter type, as shown in the following example: Copy
For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided in the Azure Databricks documentation. For no parameters, type the following code: Copy
|
Idempotency Token |
(Optional) Defines a token to use to re-execute job executions that timed out in Databricks. Values:
|
Status Polling Frequency |
(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 30 |
Failure Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 1 |
Azure HDInsight Job
Azure HDInsight enables you to execute an Apache Spark batch job and perform big data analytics.
To create an Azure HDInsight job, see Creating a Job. For more information about this plug-in, see
The following table describes Azure HDInsight job parameters:
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Azure HDInsight, as described in Azure HDInsight Connection Profile Parameters. Rules:
|
Parameters |
Determines which parameters are passed to the Apache Spark Application during job execution, in JSON format (name:value pairs). This JSON must include the file and className elements. |
Status Polling Interval |
Determines the number of seconds to wait before the Apache Spark batch job is verified. Default: 10 seconds |
Bring job logs to output |
Determines whether logs from Apache Spark appear in the job output. |
Azure Synapse Job
Azure Synapse Analytics enables you to perform data integration and big data analytics.
To create an Azure Synapse job, see Creating a Job. For more information about this plug-in, see
The following table describes Azure Synapse job parameters:
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Azure Synapse, as described in Azure Synapse Connection Profile Parameters. |
Pipeline Name |
Defines the name of a pipeline that you defined in your Azure Synapse workspace. |
Parameters |
Defines pipeline parameters, in JSON format, to override when the job executes, as shown in the following example: Copy
For no parameters, type {}. |
Status Polling Interval |
(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 20 seconds |
Databricks Job
Databricks enables you to integrate jobs created in the Databricks environment with your existing Control-M workflows.
To create a Databricks job, see Creating a Job. For more information about this plug-in, see
The following table describes the Databricks job type attributes:
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Databricks, as described in Databricks Connection Profile Parameters. Rules:
|
Databricks Job ID |
Determines the ID of the Databricks job that is created in a Databricks workspace. |
Parameters |
Defines task parameters, in JSON format, to override when the job executes, according to the Databricks convention. Your list of parameters must begin with the name of the parameter type, as shown in the following example: Copy
For more information about the parameter types, review the properties of RunParameters in the OpenAPI specification provided through the Azure Databricks documentation. For no parameters, type the following code: Copy
|
Idempotency Token |
(Optional) Defines a token to use to re-execute job executions that timed out in Databricks. Values:
Default: Control-M-Idem_%%ORDERID |
Status Polling Frequency |
(Optional) Determines the number of seconds to wait before checking the status of the job. Default: 30 |
dbt Job
dbt (Data Build Tool) is a cloud-based computing platform that enables you to develop, test, schedule, document, and analyze data models.
To create a dbt job, see Creating a Job. For more information about this plug-in, see Control-M for dbt.
The following table describes the dbt job type attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to dbt, as described in dbt Connection Profile Parameters. Rules:
|
DBT Job ID |
Defines the ID of the preexisting job in the dbt platform that you want to execute. |
Run Comment |
Defines a free-text description of the job. |
Override Job Commands |
Determines whether to override the predefined dbt job commands. |
Define Commands |
Defines the new dbt job commands. dbt test dbt run |
Status Polling Frequency |
Determines the number of seconds to wait before checking the status of the job. Default: 10 |
Failure Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
GCP BigQuery Job
Google Cloud Platform (GCP) BigQuery is a cloud-computing platform that enables you to process, analyze, and store your data.
To create a GCP BigQuery job, see Creating a Job. For more information about this plug-in, see
The following table describes the GCP BigQuery job type attributes.
Attribute |
Action |
Description |
---|---|---|
Connection Profile |
|
Determines the authorization credentials that are used to connect Control-M to GCP BigQuery, as described in GCP BigQuery Connection Profile Parameters. Rules:
|
Project Name |
All Actions |
Determines the project that the job uses. |
Dataset Name |
|
Determines the database that the job uses. |
Action |
|
Determines one of the following GCP BigQuery actions to perform:
|
Run Select Query and Copy to Table |
Query |
(Optional) Determines whether to paste the results of a SELECT statement into a new table. |
Table Name |
|
Defines the new table name. |
SQL Statement |
Query |
Defines one or more SQL statements supported by GCP BigQuery. Rule: It must be written in a single line, with character strings separated by one space only. |
Query Parameters |
Query |
Defines the query parameters, in JSON format, which enable you to control the presentation of the data, as shown in the following example: Copy
|
Copy Operation Type |
Copy |
Determines one of the following copy operations:
|
Source Table Properties |
Copy |
Defines the properties of the table, in JSON format, that is cloned, backed up, or copied. You can copy or back up one or more tables at a time, as shown in the following example: Copy
|
Destination Table Properties |
|
Defines the properties of a new table, in JSON format, as shown in the following example: Copy
|
Destination/Source Bucket URIs |
|
Defines the source or destination data URI for the table that you are loading or extracting. You can load or extract multiple tables. You must use commas to distinguish elements from each other. "gs://source1_site1/source1.json" |
Show Load Options |
Load |
Determines whether to add more fields to a table that you are loading. |
Load Options |
Load |
Defines additional fields, in JSON format, for the table that you are loading, as shown in the following example: Copy
|
Extract As |
Extract |
Determines one of the following file formats to export the data to:
|
Routine |
Routine |
Defines a routine and the values that it must execute, as shown in the following example: Copy
|
Job Timeout |
All Actions |
Determines the maximum number of milliseconds to execute the GCP BigQuery job. Default: 30,000 milliseconds (30 seconds) |
Connection Timeout |
All Actions |
Determines the number of seconds to wait before the job ends Not OK. Default: 10 |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the job. Default: 5 |
GCP Dataflow Job
Google Cloud Platform (GCP) Dataflow enables you to perform cloud-based data processing for batch and real-time data streaming applications.
To create a GCP Dataflow job, see Creating a Job. For more information about this plug-in, see
The following table describes the GCP Dataflow job type attributes.
Parameter |
Description |
---|---|
Connection profile |
Determines the authorization credentials that are used to connect Control-M to GCP Dataflow, as described in GCP Dataflow Connection Profile Parameters. |
Project ID |
Defines the project ID for your Google Cloud project. |
Location |
Defines the Google Compute Engine region to create the job. |
Template Type |
Defines one of the following types of GCP Dataflow templates:
|
Template Location (gs://) |
Defines the URL on Google Cloud Storage for the file that contains the Template definition, in the following format: gs://bucketname/filename |
Parameters (JSON Format) |
Defines input parameters, in JSON format, to be passed on to job execution. You must include the jobname and parameters elements, as shown in the following example: Copy
|
Verification Poll Interval (in seconds) |
(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 10 |
Log Level |
Determines one of the following levels of details to retrieve from the GCP logs in the case of job failure:
|
GCP Dataproc Job
Google Cloud Platform (GCP) Dataproc enables you to perform cloud-based big data processing and machine learning.
To create a GCP Dataproc job, see Creating a Job. For more information about this plug-in, see
The following table describes the GCP Dataproc job type attributes.
Parameter |
Description |
---|---|
Connection profile |
Determines the authorization credentials that are used to connect Control-M to GCP Dataproc, as described in GCP Dataproc Connection Profile Parameters. |
Project ID |
Defines the project ID for your Google Cloud project. |
Account Region |
Defines the Google Compute Engine region to create the job. |
Dataproc task type |
Defines one of the following Dataproc task types to execute:
|
Workflow Template |
(For a Workflow Template task type) Defines the ID of a Workflow Template. |
Batch ID |
Defines the ID that will become the final component of the batch resource name. Valid Values: 4-63 characters. The letters a-z, and the numbers 0-9. batch-e7f10 |
Requested ID |
(Optional) Defines the unique ID that is used to identify the request. If the service receives two CreateBatchRequests with the same Requesed ID, the second request is ignored and the operation that corresponds to the first Batch which was created and stored in the backend is returned. Valid Values: 0-40 characters. The letters a-z, A-Z, numbers 0-9, _, and -. |
Parameters (JSON Format) |
(For a Job task type) Defines input parameters to be passed on to job execution, in JSON format. You retrieve this JSON content from the GCP Dataproc UI, using the EQUIVALENT REST option in job settings. |
Verification Poll Interval (in seconds) |
(Optional) Defines the number of seconds to wait before checking the status of the job. Default: 20 |
Tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Hadoop Job
The Hadoop job connects to the Hadoop framework, which enables you to split up and process large data sets on clusters of commodity servers. You can expand your enterprise business workflows to include tasks that execute in your Big Data Hadoop cluster in Control-M with the different Hadoop-supported tools, including Pig, Hive, HDFS File Watcher, Map Reduce Jobs, and Sqoop.
To create a Hadoop job, see Creating a Job. For more information about this plug-in, see Control-M for Hadoop.
The following table describes the Hadoop job type attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to Hadoop, as described in Hadoop Connection Profile Parameters. Rules:
Variable Name: %%HDP-ACCOUNT |
Execution Type |
Determines the execution type for Hadoop job execution, as follows: Variable Name: %%HDP-EXEC_TYPE |
Pre Commands |
Defines the Pre commands performed before job execution (not for HDFS Commands jobs and Oozie Extractor jobs), and the argument for each command. |
Fail the job if the command fails |
Determines whether the entire job fails if any of the Pre commands fail (not for HDFS Commands jobs and Oozie Extractor jobs). |
Post Commands |
Defines the Post commands performed before job execution (not for HDFS Commands jobs and Oozie Extractor jobs), and the argument for each command. |
Fail the job if the command fails |
Determines whether the entire job fails if any of the Post commands fail (not for HDFS Commands jobs and Oozie Extractor jobs). |
DistCp Job Attributes
The following table describes the DistCp job attributes.
Attribute |
Description |
---|---|
Target Path |
Defines the absolute destination path. Variable Name: %%HDP-DISTCP_TARGET_PATH |
Source Path |
Defines the source paths. Variable Name: %%HDP-DISTCP_SOURCE_PATH-Nxxx_ARG |
Command Line Options |
Defines the sets of attributes and values that are added to the command line. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output A tab in the job properties pane of the Monitoring domain where the job output appears that indicates whether a job ended OK, and is used, for example, with jobs that check file location.. |
Distributed Shell Job Attributes
The following table describes the Distributed Shell job attributes.
Attribute |
Description |
---|---|
Shell Type |
Determines what the Distributed Shell job executes, as follows:
Variable Name: %%HDP-SHELL_TYPE |
Command |
Defines the shell command entry to execute for the job execution. Variable Name: %%HDP-SHELL_COMMAND |
Script Full Path |
Defines the full path to the script file which is executed. The script file is located in the HDFS. Variable Name: %%HDP-SHELL_SCRIPT_FULL_PATH |
Shell Script Arguments |
Defines the shell script arguments. Variable Name: %%HDP-SHELL-Nxxx-ARG |
More Options |
Opens more attributes. |
Files/Archives |
Defines the full path to the file or archive to upload as a dependency to the HDFS working directory. Variable Names:
|
Options |
Defines the additional option (Name and Value) to set when executing the job. Variable Names:
|
Environment Variables |
Defines the environment variables for the shell script/command. Variable Name: %%HDP-SHELL_ENV_VARIABLE-Nxxx-ARG |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
HDFS Commands Job Attributes
The following table describes the HDFS Commands job attributes.
Attribute |
Description |
---|---|
Command |
Defines the command for the argument to be performed with job execution. Variable Name: %%HDP-HDFS_CMD_ACTION-Nxxx-CMD |
Arguments |
Defines the argument used by the command. Variable Name: %%HDP-HDFS_CMD_ACTION-Nxxx-ARG |
HDFS File Watcher Job Attributes
The following table describes the HDFS File Watcher job attributes.
Attribute |
Description |
---|---|
File name full path |
Defines the full path of the file being watched. Variable Name: %%HDP-HDFS_FILE_PATH |
Min detected size |
Determines the minimum file size in bytes to meet the criteria and finish the job as OK. If the file arrives, but the size is not met, the job continues to watch the file. Variable Name: %%HDP-MIN_DETECTED_SIZE |
Max time to wait |
Determines the maximum number of minutes to wait for the file to meet the watching criteria. If criteria are not met (file did not arrive, or minimum size was not reached) the job fails after this maximum number of minutes. Variable Name: %%HDP-MAX_WAIT_TIME |
File Name Variable |
Defines the variable name that is used in succeeding jobs. Variable Name: %%HDP-FW_DETECTED _FILE_NAME_VAR |
Impala Job Attributes
The following table describes the Impala job attributes.
Attribute |
Description |
---|---|
Source |
Determines the source type to execute the queries, as follows:
Variable Name: %%HDP-IMPALA_QUERY_SOURCE |
Query File Full Path |
Defines the location of the file used to execute the queries. Variable Name: %%HDP-IMPALA_QUERY_FILE_PATH |
Query |
Defines the query command used to execute the queries. Variable Name: %%HDP-IMPALA_OPEN_QUERY |
Command Line Options |
Defines the sets of attributes and values that are added to the command line. Variable Name: %%HDP-HDP-IMPALA_CMD_OPTION-Nxxx-ARG |
Hive Job Attributes
The following table describes the Hive job attributes.
Attribute |
Description |
---|---|
Full path to Hive script |
Defines the full path to the Hive script on the Hadoop host. Variable Name: %%HDP-HIVE_SCRIPT_NAME |
Script Parameters |
Defines the list of parameters for the script. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Java-Map-Reduce Job Attributes
The following table describes the Java Map-Reduce job attributes.
Attribute |
Description |
---|---|
Full path to Jar |
Defines the full path to the jar containing the Map Reduce Java program on the Hadoop host. Variable Name: %%HDP-JAVA_JAR_NAME |
Main Class |
Defines the class that is included in the jar containing a main function and the map reduce implementation. Variable Name: %%HDP-JAVA_MAIN_CLASS |
Arguments |
Defines the argument used by the command. Variable Name: %%HDP-JAVA_Nxxx_ARG |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Oozie Job Attributes
The following table describes the Oozie job attributes.
Attribute |
Description |
---|---|
Job Properties File |
Defines the job properties file path. Variable Name: %%HDP-OOZIE_JOB_PROPERTIES_FILE |
Job Properties (Add/Overwrite) |
Defines the Oozie job properties. A set of properties is comprised of the following:
You can add new properties or override property values defined in the Job Properties File. |
Rerun from point of failure |
Determines whether to re-execute an Oozie job from the point of its failure. |
Pig Job Attributes
The following table describes the Pig job attributes.
Attribute |
Description |
---|---|
Full Path to Pig Program |
Defines the full path to the Pig program on the Hadoop host. Variable Name: %%HDP-PIG_PROG_NAME |
Pig Program Parameters |
Defines the list of program parameters. |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Properties |
Defines a list of properties (Name and Value) to be executed with the job. These properties override the Hadoop defaults. |
Archives |
Defines the location of the Hadoop archives. |
Files |
Defines the location of the Hadoop files. |
Spark Job Attributes
The following table describes the Spark job attributes.
Attribute |
Description |
---|---|
Program Type |
Determines the Spark program type, as follows:
Variable Name: %%HDP-SPARK_PROG_TYPE |
Full Path to Script |
Defines the full path to the python script to execute. Variable Name: %%HDP-SPARK_FULL_PATH_TO_PYTHON_SCRIPT |
Application Jar File |
Defines the path to the jar including your application and all the dependencies. Variable Name: %%HDP-SPARK_APP_JAR_FULL_PATH |
Main Class to Run |
Defines the main class of the application. Variable Name: %%HDP-SPARK_MAIN_CLASS_TO_RUN |
Application Arguments |
Defines the attribute arguments that are added at the end of the Spark command line either after the main class for Java / Scala Applications or after the script of the Python Script. Variable Name: %%HDP-SPARK_Nxxx_ARG |
Command Line Options |
Defines the sets of attributes and values that are added to the command line. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Sqoop Job Attributes
The following table describes the Sqoop job attributes.
Attribute |
Description |
---|---|
Command Editor |
Defines any valid Sqoop command necessary for job execution. Sqoop can only be used for job execution if defined in Sqoop connection attributes. HDP-SQOOP_COMMAND |
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Properties |
Defines a list of properties (Name and Value) to be executed with the job. These properties override the Hadoop defaults. |
Archives |
Defines the location of the Hadoop archives. |
Files |
Defines the location of the Hadoop files. |
Streaming Job Attributes
The following table describes the Streaming job attributes.
Attribute |
Description |
---|---|
Input Path |
Defines the input file for the Mapper step. Variable Name: %%HDP-INPUT_PATH |
Output Path |
Defines the HDFS output path for the Reducer step. Variable Name: %%HDP-OUTPUT_PATH |
Mapper Command |
Defines the command that executes as a mapper. Variable Name: %%HDP-MAPPER_COMMAND |
Reducer Command |
Defines the command that executes as a reducer. Variable Name: %%HDP-REDUCER_COMMAND |
Streaming Options |
Defines the sets of attributes (Name and Value) that are added to the end of the Streaming command line. Variable Names:
|
Generic Options |
Defines the sets of attributes (Name and Value) that are added to the Streaming command line. Variable Names:
|
Append Yarn aggregated logs to output |
Determines whether to add Yarn aggregated logs to the job output. |
Tajo Job Attributes
The following table describes the Tajo job attributes.
Attribute |
Description |
---|---|
Command Source |
Determines the source of the Tajo command, as follows:
|
Full File Path |
Defines the file path of the input file that executes the Tajo command. |
Open Query |
Defines the query. Variable Name: %%HDP-TAJO_OPEN_QUERY |
OCI Data Flow Job
Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets.
To create an OCI Data Flow job, see Creating a Job. For more information about this plug-in, see Control-M for OCI Data Flow .
The following table describes the OCI Data Flow job attributes.
Attribute |
Description |
---|---|
Connection Profile |
Determines the authorization credentials that are used to connect Control-M to OCI Data Science Services, as described in OCI Data Flow Connection Profile Parameters. Rules:
|
Run Name |
Defines the name of a new Run. |
Compartment OCID |
Defines the compartment Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each compartment that is created within the Oracle Data Flow Infrastructure. |
Application OCID |
Defines the application Oracle Cloud Identifier (OCID) which is a unique identifier assigned to each application that is created within the Oracle Data Flow Infrastructure. |
Additional Run Details |
(Optional) Determines whether to add more parameters to the new job run. Valid Values:
Default: Unchecked |
Run Details Configuration |
(Optional) Defines specific parameters, in JSON format, that are passed when you create a new Run. For more information about the run parameters, see CreateRunDetails Reference 20200129 in the Oracle Cloud Infrastructure Documentation. Copy
|
Status Polling Frequency |
Determines the number of seconds to wait before checking the job status. Default: 60 |
Failure tolerance |
Determines the number of times to check the job status before ending Not OK. Default: 2 |
Snowflake Job
Snowflake is a cloud-computing platform that enables you to process, analyze, and store your data.
To create a Snowflake job, see Creating a Job. For more information about this plug-in, see
The following table describes the Snowflake job type attributes.
Attribute |
Action |
Description |
---|---|---|
Connection Profile |
|
Determines one of the following types of authorization credentials, which are used to connect Control-M to Snowflake:
Rules:
|
Database |
|
Determines the database that the job uses. |
Schema |
|
Determines the schema that the job uses. A schema is an organizational model that describes the layout and definition of fields and tables, and their relationships to each other, in a database. |
Action |
|
Determines one of the following Snowflake actions to perform:
|
Snowflake SQL Statement |
SQL Statement |
Determines one or more Snowflake-supported SQL commands. Rule: Must be written in a single line, with strings separated by one space only. |
Load SQL File |
Run SQL File |
Defines the full path to the file that contains Snowflake-supported SQL commands. |
Statement Timeout |
All Actions |
Determines the maximum number of seconds to execute the job in Snowflake. |
Show More Options |
All Actions |
Determines whether the following job-defining attributes are displayed:
|
Parameters |
All Actions |
Defines Snowflake-provided parameters, in JSON format, that let you control how data is presented, as shown in the following example: Copy
|
Role |
All Actions |
Determines the Snowflake role used for this Snowflake job. A role is an entity that can be assigned privileges on secure objects. You can be assigned one or more roles from a limited selection. |
Bindings |
All Actions |
Defines the values, in JSON format to bind to the variables used in the Snowflake job. The following JSON script defines two binding variables: Copy
For more information on bindings, see the Snowflake documentation. |
Warehouse |
All Actions |
Determines the warehouse used in the Snowflake job. A warehouse is a cluster of virtual machines that processes a Snowflake job. |
Show Output |
All Actions |
Determines whether to show a full JSON response in the log output. |
Status Polling Frequency |
All Actions |
Determines the number of seconds to wait before checking the status of the job. Default: 20 |
Query to Location |
Copy from Query |
Defines the cloud storage location. |
Query Input |
Copy from Query |
Defines the query used for copying the data. |
Storage Integration |
|
Defines the storage integration object, which stores an Identity and Access Management (IAM) entity and an optional set of blocked cloud storage locations. |
Overwrite |
|
Determines whether to overwrite an existing file in the cloud storage, as follows:
|
File Format |
|
Determines one of the following file formats for the saved file:
|
Copy Destination |
Copy from Table |
Defines where the JSON or CSV file is saved. You can save to Amazon Web Services, Google Cloud Platform, or Microsoft Azure. s3://<bucket name>/ |
From Table |
Copy from Table |
Defines the name of the copied table. |
Create Table Name |
Create Table and Query |
Defines the name of the new or existing table where the data is queried. |
Query |
Create Table and Query |
Defines the query used for the copied data. |
Snowpipe Name |
|
Defines the name of the Snowpipe. A Snowpipe loads data from files when they are ready, or staged. |
Table Name |
Copy into Table |
Defines the name of the table that the data is copied into. |
From Location |
Copy into Table |
Defines the cloud storage location from where the data is copied, in CSV or JSON format. s3://location-path/FileName.csv |
Start or Pause Snowipe |
Start or Pause Snowpipe |
Determines whether to start or pause the Snowpipe, as follows:
|
Stored Procedure Name |
Stored Procedure |
Defines the name of the stored procedure. |
Procedure Argument |
Stored Procedure |
Defines the value of the argument in the stored procedure. |
Table Name |
Snowpipe Load Status |
Defines the table that is monitored when loaded by the Snowpipe. |
Stage Location |
Snowpipe Load Status |
Defines the cloud storage location. A stage is a pointer that indicates where data is stored, or staged. s3://CloudStorageLocation/ |
Days Back |
Snowpipe Load Status |
Determines the number of days to monitor the Snowpipe load status. |
Status File Cloud Location Path |
Snowpipe Load Status |
Defines the cloud storage location where a CSV file log is created. The CSV file log details the load status for each Snowpipe. |
Storage Integration |
Snowpipe Load Status |
Defines the Snowflake configuration for the cloud storage location, defined in the previous attribute—Status File Cloud Location Path. S3_INT |