Google Cloud Batch
The guide is split into two parts:
- How to configure your Google Cloud account to use the Batch API.
- How to create a Google Cloud Batch compute environment in Seqera.
Configure Google Cloud
Create a project
Go to the Google Project Selector page and select an existing project, or select Create project.
Enter a name for your new project, e.g., tower-nf.
If you are part of an organization, the location will default to your organization.
Enable billing
See here to enable billing in your Google Cloud account.
Enable APIs
See here to enable the following APIs for your project:
- Batch API
- Compute Engine API
- Cloud Storage API
Select your project from the dropdown menu and select Enable.
Alternatively, you can enable each API manually by selecting your project in the navigation bar and visiting each API page:
IAM
Seqera requires a service account with appropriate permissions to interact with your Google Cloud resources. As an IAM user, you must have access to the service account that will be submitting Batch jobs.
By default, Google Cloud Batch uses the default Compute Engine service account to submit jobs. This service account is granted the Editor (roles/Editor
) role. While this service account has the necessary permissions needed by Seqera, this role is not recommended for production environments. Control job access using a custom service account with only the permissions necessary for Seqera to execute Batch jobs instead.
Service account permissions
Create a custom service account with at least the following permissions:
- Batch Agent Reporter (
roles/batch.agentReporter
) on the project - Batch Job Editor (
roles/batch.jobsEditor
) on the project - Logs Writer (
roles/logging.logWriter
) on the project (to let jobs generate logs in Cloud Logging) - Service Account User (
roles/iam.serviceAccountUser
)
If your Google Cloud project does not require access restrictions on any of its Cloud Storage buckets, you can grant project Storage Admin (roles/storage.admin
) permissions to your service account to simplify setup. To grant access only to specific buckets, add the service account as a principal on each bucket individually. See Cloud Storage bucket below.
User permissions
Ask your Google Cloud administrator to grant you the following IAM user permissions to interact with your custom service account:
- Batch Job Editor (
roles/batch.jobsEditor
) on the project - Service Account User (
roles/iam.serviceAccountUser
) on the job's service account (default: Compute Engine service account) - View Service Accounts (
roles/iam.serviceAccountViewer
) on the project
To configure a credential in Seqera, you must first create a service account JSON key file:
-
In the Google Cloud navigation menu, select IAM & Admin > Service Accounts.
-
Select the email address of the service account.
The Compute Engine default service account is not recommended for production environments due to its powerful permissions. To use a service account other than the Compute Engine default, specify the service account email address under Advanced options on the Seqera compute environment creation form.
-
Select Keys > Add key > Create new key.
-
Select JSON as the key type.
-
Select Create.
A JSON file will be downloaded to your computer. This file contains the credential needed to configure the compute environment in Seqera.
You can manage your key from the Service Accounts page.
Cloud Storage bucket
Google Cloud Storage is a type of object storage. To access files and store the results for your pipelines, create a Cloud bucket that your Seqera service account can access.
Create a Cloud Storage bucket
-
In the hamburger menu (≡), select Cloud Storage.
-
From the Buckets tab, select Create.
-
Enter a name for your bucket. You will reference this name when you create the compute environment in Seqera.
-
Select Region for the Location type and select the Location for your bucket. You'll reference this location when you create the compute environment in Seqera.
-
Select Standard for the default storage class.
-
Under Access control, select Uniform.
The Batch API is available in a limited number of locations. These locations are only used to store metadata about the pipeline operations. The storage bucket and compute resources can be in any region.
-
Select any additional object data protection tools, per your organization's data protection requirements.
-
Select Create.
-
After the bucket is created, you are redirected to the Bucket details page.
-
Select Permissions, then Grant access under View by principals.
-
Copy the email address of your service account into New principals.
-
Select the Storage Admin role.
You've created a project, enabled the necessary Google APIs, created a bucket, and created a service account JSON key file with the required credentials. You now have what you need to set up a new compute environment in Seqera.
Seqera compute environment
Your Seqera compute environment uses resources that you may be charged for in your Google Cloud account. See Cloud costs for guidelines to manage cloud resources effectively and prevent unexpected costs.
After your Google Cloud resources have been created, create a new Seqera compute environment:
-
In a workspace, select Compute Environments > New Environment.
-
Enter a descriptive name for this environment, e.g., Google Cloud Batch (europe-north1).
-
Select Google Cloud Batch as the target platform.
-
From the Credentials drop-down, select existing Google credentials, or select + to add new credentials. If you choose to use existing credentials, skip to step 7.
-
Enter a name for the credentials, e.g., Google Cloud Credentials.
-
Enter the Service account key from the JSON key file you created earlier.
-
Select the Location where you will execute your pipelines. See Location to learn more.
-
In the Pipeline work directory field, enter your storage bucket URL, e.g.,
gs://my-bucket
. This bucket must be accessible in the location selected in the previous step.When you specify a Cloud Storage bucket as your work directory, this bucket is used for the Nextflow cloud cache by default. You can specify an alternative cache location with the Nextflow config file field on the pipeline launch form.
-
(Optional) Select Enable Wave containers to facilitate access to private container repositories and provision containers in your pipelines using the Wave containers service. See Wave containers for more information.
-
(Optional) Select Enable Fusion v2 to allow access to your GCS-hosted data via the Fusion v2 virtual distributed file system. This speeds up most data operations. The Fusion v2 file system requires Wave containers to be enabled in the previous step. See Fusion file system for configuration details.
-
Enable Spot to use spot instances, which have significantly reduced cost compared to on-demand instances.
-
Apply Resource labels to the cloud resources consumed by this compute environment. Workspace default resource labels are prefilled.
-
Expand Staging options to include optional pre- or post-run Bash scripts that execute before or after the Nextflow pipeline execution in your environment.
-
Specify custom Environment variables for the Head job and/or Compute jobs.
-
Configure any advanced options you need:
-
Enable Use Private Address to ensure that your Google Cloud VMs aren't accessible to the public internet.
-
Use Boot disk size to control the boot disk size of VMs.
-
Use Head Job CPUs and Head Job Memory to specify the CPUs and memory allocated for head jobs.
-
Use Service Account email to specify a service account email other than the Compute Engine default to execute workflows with this compute environment (recommended for production environments).
-
Use VPC to specify the name of a VPC network to be used by this compute environment.
-
Use Subnet to specify the name of a subnet to be used by this compute environment.
VPC and Subnet must both be specified for your compute environment to use either.
-
-
Select Create to finish the compute environment setup.
Your new compute environment may take a few minutes to become available for pipeline execution after it is added to your workspace.
See Launch pipelines to start running pipelines in your Google Cloud Batch compute environment.