Thoughts on GDPR: How to delete data after usage?

December 20, 2020 - 4 minutes read - 760 words

This is a collection of thoughts on how to store/organize data in a way so it can be deleted according to GDPR regulation

This list is “work in progress”

Filesystems

ZFS snapshots

When ZFS snapshots are used, for example as a countermeasure against disk encryption by ransomware, there is no easy way to remove data
Generally speaking data has to be stored in a way so that a full ZFS filesystem can be deleted when it’s no longer required.
Fortunately it’s easy using ZFS to create any number of filesystems in a ZFS pool. But this has to be done right from the beginning, especially in case when snapshots are used
In case there is ZFS replication, the data on the target systems must be deleted as well.

Version control systems

When version control systems are used, there is no easy way to delete old versions, except for starting again using the latest version as a new “initial” versions. This means that the history is lost. As for me: Why not? How often do we really look at very old versions? But yes, sometimes I do it.
I have not seen this before, but the only easy option to preserve the history is to share one acount for all commits. Merge operations happen only after review (four-eyes principle) anyway, and after ticket acceptance it is no longer the “problem” of the author. This has another effect: The resposibility is effectively reduced to the team (and no longer to an individual person).
As for Subversion it’s possible to have multiple repositories, but these have to be setup right from the beginning (I did not do this in those times)
As for Git it’s quite normal to have separate repositories for each project

Ticket systems

Setup a separate instance for each project
What happens when a person leaves a project? Is it possible to “rename” the account (just the way StackOverflow does it)?

Databases

Relational databases

Foreign key constraints

When foreign key constraints are used (which is normally the case), it might not be possible to delete rows as long as these referenced by other rows.
Therefore there must be an option to “clear” for example a user record by replacing his name with a (unique) but anonymous number etc.
In order to avoid optional attributes (address fields like city, postal code, country code) consider a one to zero/one relation between, and move all attributes into that 2nd table. This way the main table can be used for foreign keys, and the row in the 2nd table can be removed completely.

Materialized views

Obviously these must be deleted as well (or refreshed)

Log files

Transaction log files are normally rotated
As long as transaction logs exist, point-in-time-recovery is possible

Backups

There are other requirements, for example data must be stored for 10 years
In this case data must be deleted, backups must be created one more time (including restore test), and then the old versions can be deleted

Append-only databases

For example Hive databases can use Parquet files stored on HDFS
It’s not possible to update data in Parquet files.
Parquet files may contain multiple rows, so generally speaking it’s not possible to delete a Parquet file.
If Hive partitioning is used, it might be possible to drop a partition
- This requires that the partitioning scheme (which is normally used for performance reasons) is compatible with the partitioning scheme which is required to support the delete operation

Delta Lake

Delta Lake uses Parquet files with additional meta-data to support changes
This means that all versions can be inspected by the time travel feature

ETL pipelines

An ETL pipeline effectively creates a redundant copy
Each resulting data mart, OLAP cube etc. must be deleted or recomputed

Distributed commit logs (Kafka)

Consider the retention policy of each topic
Consider all topics

Messaging (Mosquitto)

While there is no retention on a topic, there can be retention for the last message in a topic

Snapshot features of ETL frameworks (Spark, Flink)

These are normally used only temporarily

Manage redundancy

Use a meta-level repository to store references to any location where a copy is stored for a given entity: Files, directories, data marts, data lakes, …
- Hard to image that this will work. If there is a interface to another system (ETL pipeline), control will be lost.
- This means that there is a shared responsibility: Every system owner has to deal with it separately

Conclusion

GDPR adds another non-functional requirement to every design/architecture: How can we remove data when it’s required by the GDPR?

Thoughts on GDPR: How to delete data after usage?

Filesystems

ZFS snapshots

Version control systems

Ticket systems

Databases

Relational databases

Foreign key constraints

Materialized views

Log files

Backups

Append-only databases

Delta Lake

ETL pipelines

Distributed commit logs (Kafka)

Messaging (Mosquitto)

Snapshot features of ETL frameworks (Spark, Flink)

Manage redundancy

Conclusion

References

Tags