Guidance for Atlas for Resiliency

The Atlas Architecture Center is in early release and includes recommendations for a single-cloud, single-region architecture. MongoDB plans to add multi-region recommendations, recommendations for compliance with external standards, and a plug-and-play Terraform code repository in future releases.

MongoDB Atlas is a highly-performant database that is designed to maintain uptime regardless of infrastructure outages, system maintenance, and more. Use the guidance on this page to plan settings to maximize the resiliency of your application and database.

Features for Atlas Resiliency

Database Replication

Atlas clusters consist of a replica set with a minimum of three nodes, and you can increase the node count to any odd number of nodes you require. Atlas first writes data from your application to a primary node, and then Atlas incrementally replicates and stores that data across all secondary nodes within your cluster. To control the durability of your data storage, you can adjust the write concern of your application code to complete the write only once a certain number of secondaries have committed the write. To learn more, see Configure Read and Write Concerns.

By default, Atlas distributes cluster nodes across availability zones within one of your chosen cloud provider's availability regions. For example, if your cluster is deployed to the cloud provider region us-east, Atlas deploys nodes to us-east-a, us-east-b and us-east-c by default.

To learn more about high availability and node distribution across regions, see Guidance for Atlas High Availability.

Self-Healing Deployments

Atlas clusters must consist of an odd number of nodes, because the node pool must elect a primary node to and from which your application writes and reads directly. A cluster consisting of an even number of nodes might result in a deadlock that prevents a primary node from being elected.

In the event that a primary node is unavailable, because of infrastructure outages, maintenance windows or any other reason, Atlas clusters self-heal by promoting an existing secondary node to the role of primary node to maintain database availability. To learn more about this process, see How does MongoDB Atlas deliver high availability?

Maintenance Window Uptime

Atlas maintains uptime during scheduled maintenance by applying updates in a rolling fashion to one node at a time. During this process, Atlas elects a new primary when necessary just as it does during any other unplanned primary node outage.

When you configure a maintenance window, select a time that corresponds to when your application has the lowest amount of traffic.

Monitoring

Atlas provides built-in tools to monitor cluster performance, query performance and more. Additionally, Atlas integrates easily with third-party services.

By actively monitoring your clusters, you can gain valuable insights into query and deployment performance. To learn more about monitoring in Atlas, see Monitor Your Clusters and Monitoring and Alerts.

Deployment Resilience Testing

You can simulate various scenarios that require disaster recovery workflows in order to measure your preparedness for such events. Specifically, with Atlas you can test primary node failover and simulate regional outages. We strongly recommend that you run these tests before deploying an application to production.

Cluster Termination Safeguards

You can prevent accidental deletion of Atlas clusters by enabling termination protection. Enabling termination protection is especially important when leveraging IaC tools like Terraform to ensure that a redeployment does not provision new infrastructure. To delete a cluster that has termination protection enabled, you must first disable termination protection. By default, Atlas disables termination protection for all clusters.

Database Backups

Atlas Cloud Backups facilitate cloud backup storage using the native snapshot functionality of cloud service provider on which your cluster is deployed. For example, if you deploy your cluster on AWS, you can elect to backup your cluster's data with snapshots taken at configurable intervals in AWS S3.

To learn more about database backup and snapshot retrieval, see Back Up Your Cluster.

For recommendations on backups, see Guidance for Atlas Backups.

Recommendations for Atlas Resiliency

Use MongoDB 8.0 or Later

To improve the resiliency of your cluster, upgrade your cluster to MongoDB 8.0. MongoDB 8.0 introduces the following performance improvements and new features related to resilience:

Improved memory management
Operation rejection filters to reactively mitigate expensive queries
Cluster-level timeouts for proactive protection against expensive read operations
Better workload isolation with the moveCollection command

Connecting Your Application to Atlas

We recommend that you use a connection method built on the most current driver version for your application's programming language whenever possible. And while the default connection string Atlas provides is a good place to start, you might want to tune it for performance in the context of your specific application and deployment architecture.

For example, you might want to set a short maxTimeMS for a microservice that provides a login capability, whereas you may want to set the maxTimeMS to a much larger value if the application code is a long-running analytics job request against the cluster.

Tuning your connection pool settings is particularly important in the context of enterprise level application deployments.

Connection Pool Considerations for Performant Applications

Opening database client connections is one of the most resource-intensive processes involved in maintaining a client connection pool that facilitates application access to your Atlas cluster.

Because of this, it is worth thinking about how and when you would like this process of opening client connections to unfold in the context of your specific application.

For example, if you are scaling your Atlas cluster to meet user demand, consider what the minimum pool size of connections your application will consistently need, so that when the application pool scales the additional networking and compute load that comes with opening new client connections doesn't undermine your application's time-sensitive need for increased database operations.

Min and Max Connection Pool Size

If your minPoolSize and maxPoolSize values are similar, the majority of your database client connections open at application startup. For example, if your minPoolSize is set to 10 and your maxPoolSize is set to 12, 10 client connections open at application startup, and only 2 more connections can then be opened during application runtime. However, if your minPoolSize is set to 10 and your maxPoolSize is set to 100, up to 90 additional connections can be opened as needed during application runtime.

Additional network overhead associated with opening new client connections. So, consider whether you would prefer to incur that network cost at application startup, or to incur it dynamcially in as as-needed basis during application runtime, which has the potential to impact operational latency and perceived performance for end-users if there is a sudden spike in requests that requires a large number of additional connections to be opened at once.

Your application's architecture is central to this consideration. If, for example, you deploy your application as microservices, consider which services should call Atlas directly as a means of controlling the dynamic expansion and contraction of your connection pool. Alternatively, if your application deployment is leveraging single-threaded resources, like AWS Lambda, your application will only ever be able to open and use one client connection, so your minPoolSize and your maxPoolSize should both be set to 1.

Query Timeout

Almost invariably, workload-specific queries from your application will vary in terms of the amount of time they take to execute in Atlas and in terms of the amount of time your application can wait for a response.

You can set query timeout behavior globally in Atlas, and you can also define it at the query level.

Retryable Database Reads and Writes

Atlas supports retryable read and retryable write operations. When enabled, Atlas retries read and write operations once as a safeguard against intermittent network outages.

Configure Read and Write Concerns

Atlas clusters eventually replicate all data across all nodes. However, you can configure the number of nodes across which data must be replicated before a read or write operation is reported to have been successful. You can define read concerns and write concerns globally in Atlas, and you can also define them at the client level in your connection string. Atlas has a default write concern of majority, meaning that data must be replicated across more than half of the nodes in your cluster before Atlas reports success. Conversely, Atlas has a default read concern of local, which means that when queried, Atlas retrieves data from only one node in your cluster

Isolate the Impact of Busy, Unsharded Collections

Sharding allows you to scale your cluster horizontally. With MongoDB, you can shard some collections, while allowing other collections in the same cluster to remain unsharded. When you create a new database, the shard in the cluster with the least amount of data is picked as that database's primary shard by default. All of the unsharded collections of that database live in that primary shard by default. This can cause increased traffic to the primary shard as your workload grows, especially if the workload growth focuses on the unsharded collections on the primary shard.

To distribute this workload better, MongoDB 8.0 allows you to move an unsharded collection to other shards from the primary shard with the moveCollection command. This allows you to place active, busy collections onto shards with less expected resource usage. With this, you can:

Optimize performance on larger, complex workloads.
Achieve better resource utilization.
Distribute date more evenly across shards.

We recommended to isolate your collection in the following circumstances:

If your primary shard experiences significant workload due to the presence of multiple high-throughput unsharded collections.
You anticipate an unsharded collection to experience future growth, which could become a bottleneck for other collections.
You are running a one-collection-per-cluster deployment design and you want to isolate those customers based on priority or workloads.
Your shards have more than a proportional amount of data due to the number of unsharded collections located on them.

To learn how to move an unsharded collection with mongosh, see Move a Collection.

Disaster Recovery

For recommendations on disaster recovery best practices for Atlas, see Guidance for Atlas Disaster Recovery and Recommended Configurations for High Availability and Recovery.

Resilient Example Application

The example application brings together the following recommendations to ensure resilience against network outages and failover events:

Use the Atlas-provided connection string with retryable writes, majority write concern, and default read concern.
Specify an operation time limit with the maxTimeMS method. For instructions on how to set maxTimeMS, refer to your specific Driver Documentation.
Handle errors for duplicate keys and timeouts.

The application is an HTTP API that allows clients to create or list user records. It exposes an endpoint that accepts GET and POST requests http://localhost:3000:

Method	Endpoint	Description
`GET`	`/users`	Gets a list of user names from a `users` collection.
`POST`	`/users`	Requires a `name` in the request body. Adds a new user to a `users` collection.

Note

The following server application uses NanoHTTPD and json which you need to add to your project as dependencies before you can run it.

1 // File: App.java
2 
3 import java.util.Map;
4 import java.util.logging.Logger;
5 
6 import org.bson.Document;
7 import org.json.JSONArray;
8 
9 import com.mongodb.MongoException;
10 import com.mongodb.client.MongoClient;
11 import com.mongodb.client.MongoClients;
12 import com.mongodb.client.MongoCollection;
13 import com.mongodb.client.MongoDatabase;
14 
15 import fi.iki.elonen.NanoHTTPD;
16 
17 public class App extends NanoHTTPD {
18     private static final Logger LOGGER = Logger.getLogger(App.class.getName());
19 
20     static int port = 3000;
21     static MongoClient client = null;
22 
23     public App() throws Exception {
24         super(port);
25 
26         // Replace the uri string with your MongoDB deployment's connection string
27         String uri = "<atlas-connection-string>";
28         client = MongoClients.create(uri);
29 
30         start(NanoHTTPD.SOCKET_READ_TIMEOUT, false);
31         LOGGER.info("\nStarted the server: http://localhost:" + port + "/ \n");
32     }
33 
34     public static void main(String[] args) {
35         try {
36             new App();
37         } catch (Exception e) {
38             LOGGER.severe("Couldn't start server:\n" + e);
39         }
40     }
41 
42     @Override
43     public Response serve(IHTTPSession session) {
44         StringBuilder msg = new StringBuilder();
45         Map<String, String> params = session.getParms();
46 
47         Method reqMethod = session.getMethod();
48         String uri = session.getUri();
49 
50         if (Method.GET == reqMethod) {
51             if (uri.equals("/")) {
52                 msg.append("Welcome to my API!");
53             } else if (uri.equals("/users")) {
54                 msg.append(listUsers(client));
55             } else {
56                 msg.append("Unrecognized URI: ").append(uri);
57             }
58         } else if (Method.POST == reqMethod) {
59             try {
60                 String name = params.get("name");
61                 if (name == null) {
62                     throw new Exception("Unable to process POST request: 'name' parameter required");
63                 } else {
64                     insertUser(client, name);
65                     msg.append("User successfully added!");
66                 }
67             } catch (Exception e) {
68                 msg.append(e);
69             }
70         }
71 
72         return newFixedLengthResponse(msg.toString());
73     }
74 
75     static String listUsers(MongoClient client) {
76         MongoDatabase database = client.getDatabase("test");
77         MongoCollection<Document> collection = database.getCollection("users");
78 
79         final JSONArray jsonResults = new JSONArray();
80         collection.find().forEach((result) -> jsonResults.put(result.toJson()));
81 
82         return jsonResults.toString();
83     }
84 
85     static String insertUser(MongoClient client, String name) throws MongoException {
86         MongoDatabase database = client.getDatabase("test");
87         MongoCollection<Document> collection = database.getCollection("users");
88 
89         collection.insertOne(new Document().append("name", name));
90         return "Successfully inserted user: " + name;
91     }
92 }

Note

The following server application uses Express, which you need to add to your project as a dependency before you can run it.

1 const express = require('express');
2 const bodyParser = require('body-parser');
3 
4 // Use the latest drivers by installing & importing them
5 const MongoClient = require('mongodb').MongoClient;
6 
7 const app = express();
8 app.use(bodyParser.json());
9 app.use(bodyParser.urlencoded({ extended: true }));
10 
11 const uri = "mongodb+srv://<db_username>:<db_password>@cluster0-111xx.mongodb.net/test?retryWrites=true&w=majority";
12 
13 const client = new MongoClient(uri, {
14     useNewUrlParser: true,
15     useUnifiedTopology: true
16 });
17 
18 // ----- API routes ----- //
19 app.get('/', (req, res) => res.send('Welcome to my API!'));
20 
21 app.get('/users', (req, res) => {
22     const collection = client.db("test").collection("users");
23 
24     collection
25     .find({})
26     .maxTimeMS(5000)
27     .toArray((err, data) => {
28         if (err) {
29             res.send("The request has timed out. Please check your connection and try again.");
30         }
31         return res.json(data);
32     });
33 });
34 
35 app.post('/users', (req, res) => {
36     const collection = client.db("test").collection("users");
37     collection.insertOne({ name: req.body.name })
38     .then(result => {
39         res.send("User successfully added!");
40     }, err => {
41         res.send("An application error has occurred. Please try again.");
42     })
43 });
44 // ----- End of API routes ----- //
45 
46 app.listen(3000, () => {
47     console.log(`Listening on port 3000.`);
48     client.connect(err => {
49         if (err) {
50             console.log("Not connected: ", err);
51             process.exit(0);
52         }
53         console.log('Connected.');
54     });
55 });

Note

The following web application uses FastAPI. To create a new application, use the FastAPI sample file structure.

1 # File: main.py
2 
3 from fastapi import FastAPI, Body, Request, Response, HTTPException, status
4 from fastapi.encoders import jsonable_encoder
5 
6 from typing import List
7 from models import User
8 
9 import pymongo
10 from pymongo import MongoClient
11 from pymongo import errors
12 
13 # Replace the uri string with your |service| connection string
14 uri = "<atlas-connection-string>"
15 db = "test"
16 
17 app = FastAPI()
18 
19 @app.on_event("startup")
20 def startup_db_client():
21     app.mongodb_client = MongoClient(uri)
22     app.database = app.mongodb_client[db]
23 
24 @app.on_event("shutdown")
25 def shutdown_db_client():
26     app.mongodb_client.close()
27 
28 ##### API ROUTES #####
29 @app.get("/users", response_description="List all users", response_model=List[User])
30 def list_users(request: Request):
31     try:
32         users = list(request.app.database["users"].find().max_time_ms(5000))
33         return users
34     except pymongo.errors.ExecutionTimeout:
35         raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="The request has timed out. Please check your connection and try again.")
36 
37 @app.post("/users", response_description="Create a new user", status_code=status.HTTP_201_CREATED)
38 def new_user(request: Request, user: User = Body(...)):
39     user = jsonable_encoder(user)
40     try:
41         new_user = request.app.database["users"].insert_one(user)
42         return {"message":"User successfully added!"}
43     except pymongo.errors.DuplicateKeyError:
44         raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Could not create user due to existing '_id' value in the collection. Try again with a different '_id' value.")

Back

High Availability

Backups

1	// File: App.java
2
3	import java.util.Map;
4	import java.util.logging.Logger;
5
6	import org.bson.Document;
7	import org.json.JSONArray;
8
9	import com.mongodb.MongoException;
10	import com.mongodb.client.MongoClient;
11	import com.mongodb.client.MongoClients;
12	import com.mongodb.client.MongoCollection;
13	import com.mongodb.client.MongoDatabase;
14
15	import fi.iki.elonen.NanoHTTPD;
16
17	public class App extends NanoHTTPD {
18	private static final Logger LOGGER = Logger.getLogger(App.class.getName());
19
20	static int port = 3000;
21	static MongoClient client = null;
22
23	public App() throws Exception {
24	super(port);
25
26	// Replace the uri string with your MongoDB deployment's connection string
27	String uri = "<atlas-connection-string>";
28	client = MongoClients.create(uri);
29
30	start(NanoHTTPD.SOCKET_READ_TIMEOUT, false);
31	LOGGER.info("\nStarted the server: http://localhost:" + port + "/ \n");
32	}
33
34	public static void main(String[] args) {
35	try {
36	new App();
37	} catch (Exception e) {
38	LOGGER.severe("Couldn't start server:\n" + e);
39	}
40	}
41
42	@Override
43	public Response serve(IHTTPSession session) {
44	StringBuilder msg = new StringBuilder();
45	Map<String, String> params = session.getParms();
46
47	Method reqMethod = session.getMethod();
48	String uri = session.getUri();
49
50	if (Method.GET == reqMethod) {
51	if (uri.equals("/")) {
52	msg.append("Welcome to my API!");
53	} else if (uri.equals("/users")) {
54	msg.append(listUsers(client));
55	} else {
56	msg.append("Unrecognized URI: ").append(uri);
57	}
58	} else if (Method.POST == reqMethod) {
59	try {
60	String name = params.get("name");
61	if (name == null) {
62	throw new Exception("Unable to process POST request: 'name' parameter required");
63	} else {
64	insertUser(client, name);
65	msg.append("User successfully added!");
66	}
67	} catch (Exception e) {
68	msg.append(e);
69	}
70	}
71
72	return newFixedLengthResponse(msg.toString());
73	}
74
75	static String listUsers(MongoClient client) {
76	MongoDatabase database = client.getDatabase("test");
77	MongoCollection<Document> collection = database.getCollection("users");
78
79	final JSONArray jsonResults = new JSONArray();
80	collection.find().forEach((result) -> jsonResults.put(result.toJson()));
81
82	return jsonResults.toString();
83	}
84
85	static String insertUser(MongoClient client, String name) throws MongoException {
86	MongoDatabase database = client.getDatabase("test");
87	MongoCollection<Document> collection = database.getCollection("users");
88
89	collection.insertOne(new Document().append("name", name));
90	return "Successfully inserted user: " + name;
91	}
92	}

1	const express = require('express');
2	const bodyParser = require('body-parser');
3
4	// Use the latest drivers by installing & importing them
5	const MongoClient = require('mongodb').MongoClient;
6
7	const app = express();
8	app.use(bodyParser.json());
9	app.use(bodyParser.urlencoded({ extended: true }));
10
11	const uri = "mongodb+srv://<db_username>:<db_password>@cluster0-111xx.mongodb.net/test?retryWrites=true&w=majority";
12
13	const client = new MongoClient(uri, {
14	useNewUrlParser: true,
15	useUnifiedTopology: true
16	});
17
18	// ----- API routes ----- //
19	app.get('/', (req, res) => res.send('Welcome to my API!'));
20
21	app.get('/users', (req, res) => {
22	const collection = client.db("test").collection("users");
23
24	collection
25	.find({})
26	.maxTimeMS(5000)
27	.toArray((err, data) => {
28	if (err) {
29	res.send("The request has timed out. Please check your connection and try again.");
30	}
31	return res.json(data);
32	});
33	});
34
35	app.post('/users', (req, res) => {
36	const collection = client.db("test").collection("users");
37	collection.insertOne({ name: req.body.name })
38	.then(result => {
39	res.send("User successfully added!");
40	}, err => {
41	res.send("An application error has occurred. Please try again.");
42	})
43	});
44	// ----- End of API routes ----- //
45
46	app.listen(3000, () => {
47	console.log(`Listening on port 3000.`);
48	client.connect(err => {
49	if (err) {
50	console.log("Not connected: ", err);
51	process.exit(0);
52	}
53	console.log('Connected.');
54	});
55	});

1	# File: main.py
2
3	from fastapi import FastAPI, Body, Request, Response, HTTPException, status
4	from fastapi.encoders import jsonable_encoder
5
6	from typing import List
7	from models import User
8
9	import pymongo
10	from pymongo import MongoClient
11	from pymongo import errors
12
13	# Replace the uri string with your \|service\| connection string
14	uri = "<atlas-connection-string>"
15	db = "test"
16
17	app = FastAPI()
18
19	@app.on_event("startup")
20	def startup_db_client():
21	app.mongodb_client = MongoClient(uri)
22	app.database = app.mongodb_client[db]
23
24	@app.on_event("shutdown")
25	def shutdown_db_client():
26	app.mongodb_client.close()
27
28	##### API ROUTES #####
29	@app.get("/users", response_description="List all users", response_model=List[User])
30	def list_users(request: Request):
31	try:
32	users = list(request.app.database["users"].find().max_time_ms(5000))
33	return users
34	except pymongo.errors.ExecutionTimeout:
35	raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="The request has timed out. Please check your connection and try again.")
36
37	@app.post("/users", response_description="Create a new user", status_code=status.HTTP_201_CREATED)
38	def new_user(request: Request, user: User = Body(...)):
39	user = jsonable_encoder(user)
40	try:
41	new_user = request.app.database["users"].insert_one(user)
42	return {"message":"User successfully added!"}
43	except pymongo.errors.DuplicateKeyError:
44	raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Could not create user due to existing '_id' value in the collection. Try again with a different '_id' value.")