Guidance for Atlas for Resiliency
MongoDB Atlas is a highly-performant database that is designed to maintain uptime regardless of infrastructure outages, system maintenance, and more. Use the guidance on this page to plan settings to maximize the resiliency of your application and database.
Features for Atlas Resiliency
Database Replication
Atlas clusters consist of a replica set with a minimum of three nodes, and you can increase the node count to any odd number of nodes you require. Atlas first writes data from your application to a primary node, and then Atlas incrementally replicates and stores that data across all secondary nodes within your cluster. To control the durability of your data storage, you can adjust the write concern of your application code to complete the write only once a certain number of secondaries have committed the write. To learn more, see Configure Read and Write Concerns.
By default, Atlas distributes cluster nodes across availability zones within
one of your chosen cloud provider's availability regions. For example, if your
cluster is deployed to the cloud provider region us-east
, Atlas deploys
nodes to us-east-a
, us-east-b
and us-east-c
by default.
To learn more about high availability and node distribution across regions, see Guidance for Atlas High Availability.
Self-Healing Deployments
Atlas clusters must consist of an odd number of nodes, because the node pool must elect a primary node to and from which your application writes and reads directly. A cluster consisting of an even number of nodes might result in a deadlock that prevents a primary node from being elected.
In the event that a primary node is unavailable, because of infrastructure outages, maintenance windows or any other reason, Atlas clusters self-heal by promoting an existing secondary node to the role of primary node to maintain database availability. To learn more about this process, see How does MongoDB Atlas deliver high availability?
Maintenance Window Uptime
Atlas maintains uptime during scheduled maintenance by applying updates in a rolling fashion to one node at a time. During this process, Atlas elects a new primary when necessary just as it does during any other unplanned primary node outage.
When you configure a maintenance window, select a time that corresponds to when your application has the lowest amount of traffic.
Monitoring
Atlas provides built-in tools to monitor cluster performance, query performance and more. Additionally, Atlas integrates easily with third-party services.
By actively monitoring your clusters, you can gain valuable insights into query and deployment performance. To learn more about monitoring in Atlas, see Monitor Your Clusters and Monitoring and Alerts.
Deployment Resilience Testing
You can simulate various scenarios that require disaster recovery workflows in order to measure your preparedness for such events. Specifically, with Atlas you can test primary node failover and simulate regional outages. We strongly recommend that you run these tests before deploying an application to production.
Cluster Termination Safeguards
You can prevent accidental deletion of Atlas clusters by enabling termination protection. Enabling termination protection is especially important when leveraging IaC tools like Terraform to ensure that a redeployment does not provision new infrastructure. To delete a cluster that has termination protection enabled, you must first disable termination protection. By default, Atlas disables termination protection for all clusters.
Database Backups
Atlas Cloud Backups facilitate cloud backup storage using the native snapshot functionality of cloud service provider on which your cluster is deployed. For example, if you deploy your cluster on AWS, you can elect to backup your cluster's data with snapshots taken at configurable intervals in AWS S3.
To learn more about database backup and snapshot retrieval, see Back Up Your Cluster.
For recommendations on backups, see Guidance for Atlas Backups.
Recommendations for Atlas Resiliency
Use MongoDB 8.0 or Later
To improve the resiliency of your cluster, upgrade your cluster to MongoDB 8.0. MongoDB 8.0 introduces the following performance improvements and new features related to resilience:
Operation rejection filters to reactively mitigate expensive queries
Cluster-level timeouts for proactive protection against expensive read operations
Better workload isolation with the moveCollection command
Connecting Your Application to Atlas
We recommend that you use a connection method built on the most current driver version for your application's programming language whenever possible. And while the default connection string Atlas provides is a good place to start, you might want to tune it for performance in the context of your specific application and deployment architecture.
For example, you might want to set a short maxTimeMS
for a
microservice that provides a login capability, whereas you may want to set the
maxTimeMS
to a much larger value if the application code is a long-running
analytics job request against the cluster.
Tuning your connection pool settings is particularly important in the context of enterprise level application deployments.
Connection Pool Considerations for Performant Applications
Opening database client connections is one of the most resource-intensive processes involved in maintaining a client connection pool that facilitates application access to your Atlas cluster.
Because of this, it is worth thinking about how and when you would like this process of opening client connections to unfold in the context of your specific application.
For example, if you are scaling your Atlas cluster to meet user demand, consider what the minimum pool size of connections your application will consistently need, so that when the application pool scales the additional networking and compute load that comes with opening new client connections doesn't undermine your application's time-sensitive need for increased database operations.
Min and Max Connection Pool Size
If your minPoolSize
and maxPoolSize
values are similar, the majority of your
database client connections open at application startup. For example, if your
minPoolSize
is set to 10
and your maxPoolSize
is set to 12
, 10
client connections open at application startup, and only 2 more connections
can then be opened during application runtime. However, if your minPoolSize
is set to 10
and your maxPoolSize
is set to 100
, up to 90 additional
connections can be opened as needed during application runtime.
Additional network overhead associated with opening new client connections. So, consider whether you would prefer to incur that network cost at application startup, or to incur it dynamcially in as as-needed basis during application runtime, which has the potential to impact operational latency and perceived performance for end-users if there is a sudden spike in requests that requires a large number of additional connections to be opened at once.
Your application's architecture is central to this consideration. If, for example,
you deploy your application as microservices, consider which services should
call Atlas directly as a means of controlling the dynamic expansion and
contraction of your connection pool. Alternatively, if your application deployment
is leveraging single-threaded resources, like AWS Lambda, your application will
only ever be able to open and use one client connection, so your minPoolSize
and your maxPoolSize
should both be set to 1
.
Query Timeout
Almost invariably, workload-specific queries from your application will vary in terms of the amount of time they take to execute in Atlas and in terms of the amount of time your application can wait for a response.
You can set query timeout behavior globally in Atlas, and you can also define it at the query level.
Retryable Database Reads and Writes
Atlas supports retryable read and retryable write operations. When enabled, Atlas retries read and write operations once as a safeguard against intermittent network outages.
Configure Read and Write Concerns
Atlas clusters eventually replicate all data across all nodes. However,
you can configure the number of nodes across which data must be replicated before
a read or write operation is reported to have been successful. You can define
read concerns and
write concerns
globally in Atlas, and you can also define them at the client level in your
connection string. Atlas has a default write concern of majority
, meaning that
data must be replicated across more than half of the nodes in your cluster
before Atlas reports success. Conversely, Atlas has a default read concern
of local
, which means that when queried, Atlas retrieves data from only
one node in your cluster
Isolate the Impact of Busy, Unsharded Collections
Sharding allows you to scale your cluster horizontally. With MongoDB, you can shard some collections, while allowing other collections in the same cluster to remain unsharded. When you create a new database, the shard in the cluster with the least amount of data is picked as that database's primary shard by default. All of the unsharded collections of that database live in that primary shard by default. This can cause increased traffic to the primary shard as your workload grows, especially if the workload growth focuses on the unsharded collections on the primary shard.
To distribute this workload better, MongoDB 8.0 allows you to move an
unsharded collection to other shards from the primary shard with the
moveCollection
command. This allows you to place active,
busy collections onto shards with less expected resource usage. With
this, you can:
Optimize performance on larger, complex workloads.
Achieve better resource utilization.
Distribute date more evenly across shards.
We recommended to isolate your collection in the following circumstances:
If your primary shard experiences significant workload due to the presence of multiple high-throughput unsharded collections.
You anticipate an unsharded collection to experience future growth, which could become a bottleneck for other collections.
You are running a one-collection-per-cluster deployment design and you want to isolate those customers based on priority or workloads.
Your shards have more than a proportional amount of data due to the number of unsharded collections located on them.
To learn how to move an unsharded collection with mongosh
, see
Move a Collection.
Disaster Recovery
For recommendations on disaster recovery best practices for Atlas, see Guidance for Atlas Disaster Recovery and Recommended Configurations for High Availability and Recovery.
Resilient Example Application
The example application brings together the following recommendations to ensure resilience against network outages and failover events:
Use the Atlas-provided connection string with retryable writes, majority write concern, and default read concern.
Specify an operation time limit with the maxTimeMS method. For instructions on how to set
maxTimeMS
, refer to your specific Driver Documentation.Handle errors for duplicate keys and timeouts.
The application is an HTTP API that allows clients to create or list user records. It exposes an endpoint that accepts GET and POST requests http://localhost:3000:
Method | Endpoint | Description |
---|---|---|
|
| Gets a list of user names from a |
|
| Requires a |
Note
1 // File: App.java 2 3 import java.util.Map; 4 import java.util.logging.Logger; 5 6 import org.bson.Document; 7 import org.json.JSONArray; 8 9 import com.mongodb.MongoException; 10 import com.mongodb.client.MongoClient; 11 import com.mongodb.client.MongoClients; 12 import com.mongodb.client.MongoCollection; 13 import com.mongodb.client.MongoDatabase; 14 15 import fi.iki.elonen.NanoHTTPD; 16 17 public class App extends NanoHTTPD { 18 private static final Logger LOGGER = Logger.getLogger(App.class.getName()); 19 20 static int port = 3000; 21 static MongoClient client = null; 22 23 public App() throws Exception { 24 super(port); 25 26 // Replace the uri string with your MongoDB deployment's connection string 27 String uri = "<atlas-connection-string>"; 28 client = MongoClients.create(uri); 29 30 start(NanoHTTPD.SOCKET_READ_TIMEOUT, false); 31 LOGGER.info("\nStarted the server: http://localhost:" + port + "/ \n"); 32 } 33 34 public static void main(String[] args) { 35 try { 36 new App(); 37 } catch (Exception e) { 38 LOGGER.severe("Couldn't start server:\n" + e); 39 } 40 } 41 42 43 public Response serve(IHTTPSession session) { 44 StringBuilder msg = new StringBuilder(); 45 Map<String, String> params = session.getParms(); 46 47 Method reqMethod = session.getMethod(); 48 String uri = session.getUri(); 49 50 if (Method.GET == reqMethod) { 51 if (uri.equals("/")) { 52 msg.append("Welcome to my API!"); 53 } else if (uri.equals("/users")) { 54 msg.append(listUsers(client)); 55 } else { 56 msg.append("Unrecognized URI: ").append(uri); 57 } 58 } else if (Method.POST == reqMethod) { 59 try { 60 String name = params.get("name"); 61 if (name == null) { 62 throw new Exception("Unable to process POST request: 'name' parameter required"); 63 } else { 64 insertUser(client, name); 65 msg.append("User successfully added!"); 66 } 67 } catch (Exception e) { 68 msg.append(e); 69 } 70 } 71 72 return newFixedLengthResponse(msg.toString()); 73 } 74 75 static String listUsers(MongoClient client) { 76 MongoDatabase database = client.getDatabase("test"); 77 MongoCollection<Document> collection = database.getCollection("users"); 78 79 final JSONArray jsonResults = new JSONArray(); 80 collection.find().forEach((result) -> jsonResults.put(result.toJson())); 81 82 return jsonResults.toString(); 83 } 84 85 static String insertUser(MongoClient client, String name) throws MongoException { 86 MongoDatabase database = client.getDatabase("test"); 87 MongoCollection<Document> collection = database.getCollection("users"); 88 89 collection.insertOne(new Document().append("name", name)); 90 return "Successfully inserted user: " + name; 91 } 92 }
Note
The following server application uses Express, which you need to add to your project as a dependency before you can run it.
1 const express = require('express'); 2 const bodyParser = require('body-parser'); 3 4 // Use the latest drivers by installing & importing them 5 const MongoClient = require('mongodb').MongoClient; 6 7 const app = express(); 8 app.use(bodyParser.json()); 9 app.use(bodyParser.urlencoded({ extended: true })); 10 11 const uri = "mongodb+srv://<db_username>:<db_password>@cluster0-111xx.mongodb.net/test?retryWrites=true&w=majority"; 12 13 const client = new MongoClient(uri, { 14 useNewUrlParser: true, 15 useUnifiedTopology: true 16 }); 17 18 // ----- API routes ----- // 19 app.get('/', (req, res) => res.send('Welcome to my API!')); 20 21 app.get('/users', (req, res) => { 22 const collection = client.db("test").collection("users"); 23 24 collection 25 .find({}) 26 .maxTimeMS(5000) 27 .toArray((err, data) => { 28 if (err) { 29 res.send("The request has timed out. Please check your connection and try again."); 30 } 31 return res.json(data); 32 }); 33 }); 34 35 app.post('/users', (req, res) => { 36 const collection = client.db("test").collection("users"); 37 collection.insertOne({ name: req.body.name }) 38 .then(result => { 39 res.send("User successfully added!"); 40 }, err => { 41 res.send("An application error has occurred. Please try again."); 42 }) 43 }); 44 // ----- End of API routes ----- // 45 46 app.listen(3000, () => { 47 console.log(`Listening on port 3000.`); 48 client.connect(err => { 49 if (err) { 50 console.log("Not connected: ", err); 51 process.exit(0); 52 } 53 console.log('Connected.'); 54 }); 55 });
Note
The following web application uses FastAPI. To create a new application, use the FastAPI sample file structure.
1 # File: main.py 2 3 from fastapi import FastAPI, Body, Request, Response, HTTPException, status 4 from fastapi.encoders import jsonable_encoder 5 6 from typing import List 7 from models import User 8 9 import pymongo 10 from pymongo import MongoClient 11 from pymongo import errors 12 13 # Replace the uri string with your |service| connection string 14 uri = "<atlas-connection-string>" 15 db = "test" 16 17 app = FastAPI() 18 19 20 def startup_db_client(): 21 app.mongodb_client = MongoClient(uri) 22 app.database = app.mongodb_client[db] 23 24 25 def shutdown_db_client(): 26 app.mongodb_client.close() 27 28 ##### API ROUTES ##### 29 30 def list_users(request: Request): 31 try: 32 users = list(request.app.database["users"].find().max_time_ms(5000)) 33 return users 34 except pymongo.errors.ExecutionTimeout: 35 raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="The request has timed out. Please check your connection and try again.") 36 37 38 def new_user(request: Request, user: User = Body(...)): 39 user = jsonable_encoder(user) 40 try: 41 new_user = request.app.database["users"].insert_one(user) 42 return {"message":"User successfully added!"} 43 except pymongo.errors.DuplicateKeyError: 44 raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Could not create user due to existing '_id' value in the collection. Try again with a different '_id' value.")