Docs Menu
Docs Home
/ /
Atlas Architecture Center
/

Guidance for Atlas for Resiliency

MongoDB Atlas is a highly-performant database that is designed to maintain uptime regardless of infrastructure outages, system maintenance, and more. Use the guidance on this page to plan settings to maximize the resiliency of your application and database.

Atlas clusters consist of a replica set with a minimum of three nodes, and you can increase the node count to any odd number of nodes you require. Atlas first writes data from your application to a primary node, and then Atlas incrementally replicates and stores that data across all secondary nodes within your cluster. To control the durability of your data storage, you can adjust the write concern of your application code to complete the write only once a certain number of secondaries have committed the write. To learn more, see Configure Read and Write Concerns.

By default, Atlas distributes cluster nodes across availability zones within one of your chosen cloud provider's availability regions. For example, if your cluster is deployed to the cloud provider region us-east, Atlas deploys nodes to us-east-a, us-east-b and us-east-c by default.

To learn more about high availability and node distribution across regions, see Guidance for Atlas High Availability.

Atlas clusters must consist of an odd number of nodes, because the node pool must elect a primary node to and from which your application writes and reads directly. A cluster consisting of an even number of nodes might result in a deadlock that prevents a primary node from being elected.

In the event that a primary node is unavailable, because of infrastructure outages, maintenance windows or any other reason, Atlas clusters self-heal by promoting an existing secondary node to the role of primary node to maintain database availability. To learn more about this process, see How does MongoDB Atlas deliver high availability?

Atlas maintains uptime during scheduled maintenance by applying updates in a rolling fashion to one node at a time. During this process, Atlas elects a new primary when necessary just as it does during any other unplanned primary node outage.

When you configure a maintenance window, select a time that corresponds to when your application has the lowest amount of traffic.

Atlas provides built-in tools to monitor cluster performance, query performance and more. Additionally, Atlas integrates easily with third-party services.

By actively monitoring your clusters, you can gain valuable insights into query and deployment performance. To learn more about monitoring in Atlas, see Monitor Your Clusters and Monitoring and Alerts.

You can simulate various scenarios that require disaster recovery workflows in order to measure your preparedness for such events. Specifically, with Atlas you can test primary node failover and simulate regional outages. We strongly recommend that you run these tests before deploying an application to production.

You can prevent accidental deletion of Atlas clusters by enabling termination protection. Enabling termination protection is especially important when leveraging IaC tools like Terraform to ensure that a redeployment does not provision new infrastructure. To delete a cluster that has termination protection enabled, you must first disable termination protection. By default, Atlas disables termination protection for all clusters.

Atlas Cloud Backups facilitate cloud backup storage using the native snapshot functionality of cloud service provider on which your cluster is deployed. For example, if you deploy your cluster on AWS, you can elect to backup your cluster's data with snapshots taken at configurable intervals in AWS S3.

To learn more about database backup and snapshot retrieval, see Back Up Your Cluster.

For recommendations on backups, see Guidance for Atlas Backups.

To improve the resiliency of your cluster, upgrade your cluster to MongoDB 8.0. MongoDB 8.0 introduces the following performance improvements and new features related to resilience:

We recommend that you use a connection method built on the most current driver version for your application's programming language whenever possible. And while the default connection string Atlas provides is a good place to start, you might want to tune it for performance in the context of your specific application and deployment architecture.

For example, you might want to set a short maxTimeMS for a microservice that provides a login capability, whereas you may want to set the maxTimeMS to a much larger value if the application code is a long-running analytics job request against the cluster.

Tuning your connection pool settings is particularly important in the context of enterprise level application deployments.

Opening database client connections is one of the most resource-intensive processes involved in maintaining a client connection pool that facilitates application access to your Atlas cluster.

Because of this, it is worth thinking about how and when you would like this process of opening client connections to unfold in the context of your specific application.

For example, if you are scaling your Atlas cluster to meet user demand, consider what the minimum pool size of connections your application will consistently need, so that when the application pool scales the additional networking and compute load that comes with opening new client connections doesn't undermine your application's time-sensitive need for increased database operations.

If your minPoolSize and maxPoolSize values are similar, the majority of your database client connections open at application startup. For example, if your minPoolSize is set to 10 and your maxPoolSize is set to 12, 10 client connections open at application startup, and only 2 more connections can then be opened during application runtime. However, if your minPoolSize is set to 10 and your maxPoolSize is set to 100, up to 90 additional connections can be opened as needed during application runtime.

Additional network overhead associated with opening new client connections. So, consider whether you would prefer to incur that network cost at application startup, or to incur it dynamcially in as as-needed basis during application runtime, which has the potential to impact operational latency and perceived performance for end-users if there is a sudden spike in requests that requires a large number of additional connections to be opened at once.

Your application's architecture is central to this consideration. If, for example, you deploy your application as microservices, consider which services should call Atlas directly as a means of controlling the dynamic expansion and contraction of your connection pool. Alternatively, if your application deployment is leveraging single-threaded resources, like AWS Lambda, your application will only ever be able to open and use one client connection, so your minPoolSize and your maxPoolSize should both be set to 1.

Almost invariably, workload-specific queries from your application will vary in terms of the amount of time they take to execute in Atlas and in terms of the amount of time your application can wait for a response.

You can set query timeout behavior globally in Atlas, and you can also define it at the query level.

Atlas supports retryable read and retryable write operations. When enabled, Atlas retries read and write operations once as a safeguard against intermittent network outages.

Atlas clusters eventually replicate all data across all nodes. However, you can configure the number of nodes across which data must be replicated before a read or write operation is reported to have been successful. You can define read concerns and write concerns globally in Atlas, and you can also define them at the client level in your connection string. Atlas has a default write concern of majority, meaning that data must be replicated across more than half of the nodes in your cluster before Atlas reports success. Conversely, Atlas has a default read concern of local, which means that when queried, Atlas retrieves data from only one node in your cluster

Sharding allows you to scale your cluster horizontally. With MongoDB, you can shard some collections, while allowing other collections in the same cluster to remain unsharded. When you create a new database, the shard in the cluster with the least amount of data is picked as that database's primary shard by default. All of the unsharded collections of that database live in that primary shard by default. This can cause increased traffic to the primary shard as your workload grows, especially if the workload growth focuses on the unsharded collections on the primary shard.

To distribute this workload better, MongoDB 8.0 allows you to move an unsharded collection to other shards from the primary shard with the moveCollection command. This allows you to place active, busy collections onto shards with less expected resource usage. With this, you can:

  • Optimize performance on larger, complex workloads.

  • Achieve better resource utilization.

  • Distribute date more evenly across shards.

We recommended to isolate your collection in the following circumstances:

  • If your primary shard experiences significant workload due to the presence of multiple high-throughput unsharded collections.

  • You anticipate an unsharded collection to experience future growth, which could become a bottleneck for other collections.

  • You are running a one-collection-per-cluster deployment design and you want to isolate those customers based on priority or workloads.

  • Your shards have more than a proportional amount of data due to the number of unsharded collections located on them.

To learn how to move an unsharded collection with mongosh, see Move a Collection.

For recommendations on disaster recovery best practices for Atlas, see Guidance for Atlas Disaster Recovery and Recommended Configurations for High Availability and Recovery.

The example application brings together the following recommendations to ensure resilience against network outages and failover events:

  • Use the Atlas-provided connection string with retryable writes, majority write concern, and default read concern.

  • Specify an operation time limit with the maxTimeMS method. For instructions on how to set maxTimeMS, refer to your specific Driver Documentation.

  • Handle errors for duplicate keys and timeouts.

The application is an HTTP API that allows clients to create or list user records. It exposes an endpoint that accepts GET and POST requests http://localhost:3000:

Method
Endpoint
Description

GET

/users

Gets a list of user names from a users collection.

POST

/users

Requires a name in the request body. Adds a new user to a users collection.

Note

The following server application uses NanoHTTPD and json which you need to add to your project as dependencies before you can run it.

1// File: App.java
2
3import java.util.Map;
4import java.util.logging.Logger;
5
6import org.bson.Document;
7import org.json.JSONArray;
8
9import com.mongodb.MongoException;
10import com.mongodb.client.MongoClient;
11import com.mongodb.client.MongoClients;
12import com.mongodb.client.MongoCollection;
13import com.mongodb.client.MongoDatabase;
14
15import fi.iki.elonen.NanoHTTPD;
16
17public class App extends NanoHTTPD {
18 private static final Logger LOGGER = Logger.getLogger(App.class.getName());
19
20 static int port = 3000;
21 static MongoClient client = null;
22
23 public App() throws Exception {
24 super(port);
25
26 // Replace the uri string with your MongoDB deployment's connection string
27 String uri = "<atlas-connection-string>";
28 client = MongoClients.create(uri);
29
30 start(NanoHTTPD.SOCKET_READ_TIMEOUT, false);
31 LOGGER.info("\nStarted the server: http://localhost:" + port + "/ \n");
32 }
33
34 public static void main(String[] args) {
35 try {
36 new App();
37 } catch (Exception e) {
38 LOGGER.severe("Couldn't start server:\n" + e);
39 }
40 }
41
42 @Override
43 public Response serve(IHTTPSession session) {
44 StringBuilder msg = new StringBuilder();
45 Map<String, String> params = session.getParms();
46
47 Method reqMethod = session.getMethod();
48 String uri = session.getUri();
49
50 if (Method.GET == reqMethod) {
51 if (uri.equals("/")) {
52 msg.append("Welcome to my API!");
53 } else if (uri.equals("/users")) {
54 msg.append(listUsers(client));
55 } else {
56 msg.append("Unrecognized URI: ").append(uri);
57 }
58 } else if (Method.POST == reqMethod) {
59 try {
60 String name = params.get("name");
61 if (name == null) {
62 throw new Exception("Unable to process POST request: 'name' parameter required");
63 } else {
64 insertUser(client, name);
65 msg.append("User successfully added!");
66 }
67 } catch (Exception e) {
68 msg.append(e);
69 }
70 }
71
72 return newFixedLengthResponse(msg.toString());
73 }
74
75 static String listUsers(MongoClient client) {
76 MongoDatabase database = client.getDatabase("test");
77 MongoCollection<Document> collection = database.getCollection("users");
78
79 final JSONArray jsonResults = new JSONArray();
80 collection.find().forEach((result) -> jsonResults.put(result.toJson()));
81
82 return jsonResults.toString();
83 }
84
85 static String insertUser(MongoClient client, String name) throws MongoException {
86 MongoDatabase database = client.getDatabase("test");
87 MongoCollection<Document> collection = database.getCollection("users");
88
89 collection.insertOne(new Document().append("name", name));
90 return "Successfully inserted user: " + name;
91 }
92}

Note

The following server application uses Express, which you need to add to your project as a dependency before you can run it.

1const express = require('express');
2const bodyParser = require('body-parser');
3
4// Use the latest drivers by installing & importing them
5const MongoClient = require('mongodb').MongoClient;
6
7const app = express();
8app.use(bodyParser.json());
9app.use(bodyParser.urlencoded({ extended: true }));
10
11const uri = "mongodb+srv://<db_username>:<db_password>@cluster0-111xx.mongodb.net/test?retryWrites=true&w=majority";
12
13const client = new MongoClient(uri, {
14 useNewUrlParser: true,
15 useUnifiedTopology: true
16});
17
18// ----- API routes ----- //
19app.get('/', (req, res) => res.send('Welcome to my API!'));
20
21app.get('/users', (req, res) => {
22 const collection = client.db("test").collection("users");
23
24 collection
25 .find({})
26 .maxTimeMS(5000)
27 .toArray((err, data) => {
28 if (err) {
29 res.send("The request has timed out. Please check your connection and try again.");
30 }
31 return res.json(data);
32 });
33});
34
35app.post('/users', (req, res) => {
36 const collection = client.db("test").collection("users");
37 collection.insertOne({ name: req.body.name })
38 .then(result => {
39 res.send("User successfully added!");
40 }, err => {
41 res.send("An application error has occurred. Please try again.");
42 })
43});
44// ----- End of API routes ----- //
45
46app.listen(3000, () => {
47 console.log(`Listening on port 3000.`);
48 client.connect(err => {
49 if (err) {
50 console.log("Not connected: ", err);
51 process.exit(0);
52 }
53 console.log('Connected.');
54 });
55});

Note

The following web application uses FastAPI. To create a new application, use the FastAPI sample file structure.

1# File: main.py
2
3from fastapi import FastAPI, Body, Request, Response, HTTPException, status
4from fastapi.encoders import jsonable_encoder
5
6from typing import List
7from models import User
8
9import pymongo
10from pymongo import MongoClient
11from pymongo import errors
12
13# Replace the uri string with your |service| connection string
14uri = "<atlas-connection-string>"
15db = "test"
16
17app = FastAPI()
18
19@app.on_event("startup")
20def startup_db_client():
21 app.mongodb_client = MongoClient(uri)
22 app.database = app.mongodb_client[db]
23
24@app.on_event("shutdown")
25def shutdown_db_client():
26 app.mongodb_client.close()
27
28##### API ROUTES #####
29@app.get("/users", response_description="List all users", response_model=List[User])
30def list_users(request: Request):
31 try:
32 users = list(request.app.database["users"].find().max_time_ms(5000))
33 return users
34 except pymongo.errors.ExecutionTimeout:
35 raise HTTPException(status_code=status.HTTP_503_SERVICE_UNAVAILABLE, detail="The request has timed out. Please check your connection and try again.")
36
37@app.post("/users", response_description="Create a new user", status_code=status.HTTP_201_CREATED)
38def new_user(request: Request, user: User = Body(...)):
39 user = jsonable_encoder(user)
40 try:
41 new_user = request.app.database["users"].insert_one(user)
42 return {"message":"User successfully added!"}
43 except pymongo.errors.DuplicateKeyError:
44 raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Could not create user due to existing '_id' value in the collection. Try again with a different '_id' value.")

Back

High Availability