This is a guest post from Xiangmin Xu and Tanuj Tiwari, Members of Technical Staff at Character.AI.
At Character.AI, our mission is to create immersive and personalized AI experiences. As our platform grew, we faced significant challenges in scaling our notification system. This blog post explores how we leveraged Temporal to overcome these challenges and build a robust, scalable notification infrastructure.
The Challenge: Scaling Notifications
As Character.AI's user base expanded, we encountered several issues with our notification system:
- Volume: We needed to handle a massive increase in notification volume.
- Personalization: Each notification needed to be tailored to individual users.
- Reliability: The system had to be robust and capable of handling failures gracefully.
- Scalability: We needed to process notifications for millions of users efficiently.
The Solution: Temporal Workflows
After evaluating various options, we chose to integrate Temporal into our notification system. Temporal is a workflow engine that helps build scalable, resilient applications. Here's how it addressed our challenges:
1. Centralized Orchestration
We built a centralized system for delivering personalized notifications using Temporal as the backbone. This system consists of two main components:
- Engagement Service: Decides which notifications to send to users, when to send them, and what content to include.
- Delivery Service: Schedules, compiles, and delivers notifications at scale. Both services are modeled as Temporal Workflows, with multiple Activities.
2. Workflow Management
Temporal allowed us to define and manage complex workflows for notification generation and delivery. Key benefits include:
- Fault Tolerance: Temporal's built-in retry mechanisms ensure that temporary failures don't disrupt the entire process.
- State Management: Workflows can be paused, resumed, or restarted from their last known state, improving system resilience.
- Visibility: Temporal provides excellent observability into Workflow execution, making it easier to monitor and debug issues.
3. Scalable Batching
To handle the massive volume of notifications, we implemented a batching system using Temporal:
- BigQuery Integration: We use BigQuery to efficiently fetch user data in streams.
- Parallel Processing: Multiple streams are processed concurrently, significantly improving throughput.
- Kafka Integration: Processed data is dispatched to Kafka for further handling.
4. Scheduled Notifications
We implemented a ScheduledNotificationsWorkflow that runs every 5 minutes to process pending notifications:
- Batch Processing: Notifications are fetched and processed in batches of up to 1000.
- Parallelization: Each batch is divided into 20 chunks and processed concurrently.
- Automatic Retries: Temporal's retry policies ensure that transient failures are handled gracefully.
Results and Benefits
By integrating Temporal into our notification system, we achieved:
- 100x Increase in Notification Volume: We can now handle a significantly larger number of notifications.
- Improved Reliability: The system is more robust and can recover from failures automatically.
- Better Observability: Temporal provides detailed insights into workflow execution.
- Flexibility: We can easily add new notification channels and experiment with different notification types.
Conclusion
Integrating Temporal into Character.AI's notification system has been a game-changer. It has allowed us to scale our operations, improve reliability, and maintain the personalized experience our users expect. As we continue to grow, Temporal will play a crucial role in helping us deliver immersive AI experiences to millions of users worldwide.
By leveraging Temporal's powerful workflow engine, we've built a notification system that's not just scalable and reliable, but also flexible enough to evolve with our rapidly growing platform.