A database application is incomplete without the ability to delete. Yet, a typical development workflow tends to put more focus on the creation and updating of records. After all, deleting is as simple as a handwaving dismissal, or is it? Even without referential integrity and data retention considerations, deletion deserves more engineering attention.
Deletion is Necessary
At the risk of sounding that we are creating problems to solve, we need to delete more, and deal with its accompanied complications.
A user-initiated deletion is the D in the CRUD (Create, Read, Update and Delete). It is the last-mile feature we often begrudgingly put in. When a record can be created and removed, it has a complete life cycle.
There is another kind of deletion - "periodical purging". Temporary tables must be released, records that are kept beyond the user-consented retention should be removed. Heck, we should proactively protect user's privacy by suggesting an even shorter retention period in the interest of storage and performance.
"But storage is cheap", as one would typically point out. This is not exactly true. We are looking at an unbound growth of data. If 100 Giga Bytes is cheap, will 100 Tera Bytes just be as cheap? Or 100 Peta Bytes? When there is more data, the processor requirement goes up exponentially. For processors to work with disk data effectively, we also need more RAM. CPUs and RAM are not cheap.
Having a capped limit requires the alignment between record keeping policy, data storage layout, and cleanup patrol cadence. This alignment is easily achieved if the designer of the system is mindful of a capped growth. Remember, the architecture of a system reflects the mind of its makers.
Deletion is Wasteful
Here comes a classic argument of Recycle versus Reuse. Deleting data frees up space, but in a MySQL table with an auto-incremented ID, the ID space is nonetheless used up. Record removal does not recover IDs for a good reason. Auto-increment IDs come with the guarantee that the identity of a record is always unique.
Time-sensitive server-side tokens are a good candidate for ID recycling. For example, GS Chat, our live chat system issues an auto ID to every visiting client. If no chat is initiated, the ID goes into hibernation. Since the chat session ID is authenticated by time, and that the chat client actively invalidates a long idling session, the reused ID will not cause any confusion.
In general, ID-reuse requires careful planning and orchestration of a few moving components. When done correctly, the ID space saving is significant.
Deletion is Expensive
If you use MySQL's InnoDB storage engine, which is most likely the case, the performance cost of deleting records may come as a surprise. On a simple table where the only column is the primary key, deleting all but one 15 million records could take up a minute! The date length is only 60 MB, so why does it take this long to drop 60MB?
This is because InnoDB reorganizes the data upon deletion. When many records are deleted, the effort to "re-partition" the table takes exponential amount of time. To make things worse, deleting a record, which is an apparent row-level operation, is now promoted to the table level. When deletion and database exporting (mysqldump) take place at the same time, the database will be locked for a long time.
We can mitigate the performance hit in two ways:
First, limit the spread of affected records. Chunk the delete into smaller batches. In the case of data patrol, increase the frequency as well as reducing the batch size.
Second, reduce the disk I/O cost. This involves tweaking the trx commit level and the buffer pool size.
Another contributing factor is replication. In a master-slave setup, the data_length (as polled in information_schema.tables) is not immediately reduced. The space is freed up only when the binlog is caught up. When making a data dump, it's highly recommended to export from a read-only replica (slave) only.