This is a collection of thoughts on how to store/organize data in a way so it can be deleted according to GDPR regulation
This list is “work in progress”
- When ZFS snapshots are used, for example as a countermeasure against disk encryption by ransomware, there is no easy way to remove data
- Generally speaking data has to be stored in a way so that a full ZFS filesystem can be deleted when it’s no longer required.
- Fortunately it’s easy using ZFS to create any number of filesystems in a ZFS pool. But this has to be done right from the beginning, especially in case when snapshots are used
- In case there is ZFS replication, the data on the target systems must be deleted as well.
Version control systems
- When version control systems are used, there is no easy way to delete old versions, except for starting again using the latest version as a new “initial” versions. This means that the history is lost. As for me: Why not? How often do we really look at very old versions? But yes, sometimes I do it.
- I have not seen this before, but the only easy option to preserve the history is to share one acount for all commits. Merge operations happen only after review (four-eyes principle) anyway, and after ticket acceptance it is no longer the “problem” of the author. This has another effect: The resposibility is effectively reduced to the team (and no longer to an individual person).
- As for Subversion it’s possible to have multiple repositories, but these have to be setup right from the beginning (I did not do this in those times)
- As for Git it’s quite normal to have separate repositories for each project
- Setup a separate instance for each project
- What happens when a person leaves a project? Is it possible to “rename” the account (just the way StackOverflow does it)?
Foreign key constraints
- When foreign key constraints are used (which is normally the case), it might not be possible to delete rows as long as these referenced by other rows.
- Therefore there must be an option to “clear” for example a user record by replacing his name with a (unique) but anonymous number etc.
- In order to avoid optional attributes (address fields like city, postal code, country code) consider a one to zero/one relation between, and move all attributes into that 2nd table. This way the main table can be used for foreign keys, and the row in the 2nd table can be removed completely.
- Obviously these must be deleted as well (or refreshed)
- Transaction log files are normally rotated
- As long as transaction logs exist, point-in-time-recovery is possible
- There are other requirements, for example data must be stored for 10 years
- In this case data must be deleted, backups must be created one more time (including restore test), and then the old versions can be deleted
- For example Hive databases can use Parquet files stored on HDFS
- It’s not possible to update data in Parquet files.
- Parquet files may contain multiple rows, so generally speaking it’s not possible to delete a Parquet file.
- If Hive partitioning is used, it might be possible to drop a partition
- This requires that the partitioning scheme (which is normally used for performance reasons) is compatible with the partitioning scheme which is required to support the delete operation
- Delta Lake uses Parquet files with additional meta-data to support changes
- This means that all versions can be inspected by the time travel feature
- An ETL pipeline effectively creates a redundant copy
- Each resulting data mart, OLAP cube etc. must be deleted or recomputed
Distributed commit logs (Kafka)
- Consider the retention policy of each topic
- Consider all topics
- While there is no retention on a topic, there can be retention for the last message in a topic
Snapshot features of ETL frameworks (Spark, Flink)
- These are normally used only temporarily
- Use a meta-level repository to store references to any location where a copy is stored for a given entity: Files, directories, data marts, data lakes, …
- Hard to image that this will work. If there is a interface to another system (ETL pipeline), control will be lost.
- This means that there is a shared responsibility: Every system owner has to deal with it separately
- GDPR adds another non-functional requirement to every design/architecture: How can we remove data when it’s required by the GDPR?