1/6/2025 Update: Google Cloud Run now supports autoscaling by CPU Utilization, which is critical for running Temporal Workers. The code has also been updated to use newer versions of Spring Boot and the Temporal SDK.
Unlike Kubernetes, Google Cloud Run makes it trivial to deploy container-based applications. Have a web API that you want to deploy? Package it up in a container and let Cloud Run take over provisioning the underlying infrastructure, load balancer, and DNS endpoint, and your application is ready to receive traffic. As traffic to your popular API goes up and down, Cloud Run automatically scales based on the inbound traffic and CPU utilization. It's truly amazing how easy it is to deploy, run, and scale web-based applications.
Temporal Worker applications, however, operate differently. They long-poll Temporal Cloud and process tasks as they become available. Because of this inherent difference, Cloud Run, by default, will not see any activity (no inbound requests) and stop the Worker from running. Clearly not an optimal situation.
Another limitation is that Cloud Run only allows a single port to be publicly exposed, which means you need to decide what to bind to that port. Temporal Workers that also bind a UI or API to that port can't expose SDK metrics. Temporal Workers that do not have a UI or API could theoretically expose SDK metrics, but you likely don't want to expose internal details of your application to the public.
With some configuration and a sidecar container, getting a Temporal Worker running on Cloud Run is straightforward. This article will focus on the required Cloud Run configuration and sidecar container aspects and will not focus on handling mTLS secrets or other application-specific details. A complete example that includes application details and can be found here.
Configuring Your Application
One of the nice features about Cloud Run is that for most applications, you just need a containerized application. Since this example requires non-default behavior, you need to use a Cloud Run Service YAML file to configure it.
Disable CPU Throttling
To ensure that the Worker stays running, you'll need to disable CPU throttling. Note that this means your Worker will continue to run even when there isn't work to process.
spec:
template:
metadata:
annotations:
run.googleapis.com/cpu-throttling: 'false' # we need to keep the CPU running
Set Minimum Number of Instances
By default, an application running in Cloud Run is scaled down to zero instances if there is no inbound web traffic. To change this, set the minimum number of instances:
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/minScale: '1' # keep one instance available
Sidecar Container
Sidecar containers are a common deployment pattern in the Kubernetes ecosystem. They are another application that is deployed alongside the primary service or application that provides additional functionality. Service meshes in Kubernetes like Istio and Linkerd use sidecars to control traffic in and out of the application. Cloud Run recently added sidecar containers.
Temporal Workers can be configured to emit metrics. These metrics are made available as a prometheus scrape endpoint. Since the application won't be making these metrics publicly visible, a sidecar container will be deployed to read the metrics endpoint and send the metrics to Google Cloud Managed Service for Prometheus using the Open Telemetry Connector.
Open Telemetry Connector
OpenTelemetry is an observability framework and toolkit that is designed to manage telemetry data such as traces, metrics, and logs. It is vendor- and tool-agnostic. The Open Telemetry Connector acts like a proxy to receive, process, and export data to a supported platform. In addition to supporting Google Cloud Managed Service for Prometheus, other exporters include Datadog, Splunk, and Google Cloud Pubsub. A full list of exporters is available here.
Collector Configuration
The OpenTelemetry collector uses a configuration file that specifies the receivers, processors, exporters, and the service section. The receivers section needs to look similar to this:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'temporal-metrics-app'
scrape_interval: 5s
metrics_path: '/prometheus'
static_configs:
- targets: ['127.0.0.1:8081']
The two important lines are metrics_path
, which is the path used to read the application's metrics, and targets
, which indicates the IP address and port number of the application. Notice that the IP refers to localhost, and the port must match the port of the application that exposes the metrics.
In Cloud Run, sidecar containers share the same network namespace and communicate with each other using localhost
and the corresponding port.
For the exporters section, the configuration is trivial:
exporters:
googlemanagedprometheus:
Simply defining googlemanagedprometheus
is sufficient. The OpenTelemetry Connector supports multiple exporters (and receivers too), so if you wanted to send the metrics to an additional destination, or to somewhere other than Google Managed Service for Prometheus, you would need to add the appropriate configuration here.
There are other sections that were used but have left them out for brevity. The full configuration file is available here.
Viewing the Metrics
Once the application is deployed, and Workers have been triggered, metrics will be sent to Google Managed Service for Prometheus. To view them, open up the Google Cloud Console and navigate to Monitoring, Metrics Explorer. In the Metric drop down under Select a Metric, scroll down to Prometheus Target, Temporal to see a list of active metrics.
Click a metric, such as Prometheus/temporal_long_request_total/counter
, and click on Apply. In the time box near the upper right of the screen, click the down arrow and select the Last 30 minutes. If you have activity, you should see a graph that might look similar to this:
Feel free to experiment with adding additional metrics. The Temporal documentation on SDK metrics provides detailed information on metrics, their type, and other key information. Key metrics for tuning performance on workers can be found here.
Scaling
Scaling instances in Cloud Run is done in one of two ways: based on the incoming requests and/or CPU utilization. Cloud Run changes the scaling characteristics based on the billing type.
When request-based billing is configured, CPU utilization scaling only works in conjunction with incoming requests. Since Temporal Workers run continuously, this approach will not work. With instance-based billing, Cloud Run scales based solely on CPU utilization, which works better for Temporal Workers. Additional details on scaling and billing settings can be found here.
Cloud Run handles the scaling of Temporal Workers if CPU utilization is a good metric to scale the number of instances either up or down. For workloads that require different metrics, you will need to come up with a custom scaling solution that reads either the appropriate metrics or backlog metrics and updates the number of instances using the gcloud command:
gcloud run services update <SERVICE_NAME> --region=<REGION> --min-instances=X
Wrapping It All Up
When should you use Cloud Run, and when should you use Kubernetes? The answer to this question is not that simple because it depends on a number of factors. How much experience does your team have with Kubernetes? How quickly will you need to scale? How many distinct Temporal Workers will you be running? Are you using a microservices architecture?
My general recommendation, without knowing the specifics of your requirements, skills, and objectives, would be to start with Cloud Run. If you outgrow Cloud Run, then use GKE Autopilot, and if you run into a limitation on Autopilot, use GKE Standard.
Hopefully, this post—plus some help from a sidecar container—will prepare you for deploying Temporal Workers to Cloud Run and viewing SDK metrics. For a full working example, be sure to check out the repository here.
What types of Temporal Workers will you be deploying on Cloud Run? Let us know in our Community Slack Channel!