tyny.dev - the multifarious json builder | Product Hunt
TAGS

How to find and remove duplicates in MongoDB

In this blog post which is part of the series "A Developer's Notes", we'd like to record useful MongoDB operations in regards of finding and removing duplicate documents, as this is a common problem developers face.

The Problem

When trying to add a unique index to a MongoDB collection which contains duplicate documents already, you'll be acknowledged with an E11000 (duplicate key) error.

This is because some of the fields from the index specification contain the same values in the actual data already. Therefore, MongoDB is denying the index creation. Those fields can be in whatever combination.

What's an unique index?

Indexes are a table of contents for the actual data, like telephone registers. Having them in your database makes data lookups a lot faster than without them. These are very important for using your database efficiently!

Some of these indexes can have specific properties which make them behave in certain ways. A unique index means that at least one field of the index specification is required to contain only different values. Every field in the index can be unique though. That's a feature you'll find in almost all databases nowadays, and that helps keeping your data consistent in the way you modelled it to be.

An example

Let's have a look at an example: imagine a database collection for storing users. By adding a unique index on the username field, which might contain the email address or an actual username, you make sure that there's only one user document for every registered email address or username.

This means that whatever happens during the user registration process in your application, MongoDB ensures that there's only one user document existing in your database ever - given unique fields of course.

The solution

Use the following MongoDB operations to identify duplicate documents in your collections, and remove them safely. Please change collectionName and the fields for _id, which are supposed to be unique, to your needs accordingly.

Change _id from being '$username' only to multiple fields by specifying an object like {username: '$username', email: '$email'}.

Get a count of all the duplicates

List duplicates with counts individually

Remove duplicates and keep only one document

The above operation keeps one random document. If you want to keep one specific document, for example the oldest one, add a $sort stage to your query.

And that's it!

We hope you find this useful.

Thank you very much.

Photo by Jørgen Håland on Unsplash

[EDIT: 28/04/2023]

Thanks to everybody who got in touch with us after we published this post and asked about how to keep a specific duplicate, for example keeping the oldest document.

We decided to distill our answers into this post, so that other people can benefit from them too. So here you go:

Remove duplicates and keep the oldest document

Let's have a look on the key things above:

  • Add a $sort stage in front of the $group stage. Therefore the documents are going to be sorted in memory before they get grouped and pushed to the "dups" array. Manipulate this $sort stage depending on what documents you like to keep

  • Afterwards (in the JavaScript part), we iterate over the aggregated documents and keep the first item of the "dups" array (doc.dups.shift()). All remaining dups get removed



 

This product has been added to your cart

CHECKOUT