Does 'expireSnapshots' also remove data files ?
See original GitHub issueSnapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small.
so, I executed expireSnapshots did it also automatically marked my data files as deleted?
after expireSnapshots I executed removeOrphanFiles and some folders in s3 became empty.
I thought that expireSnapshots only removes *.metadata.json and corresponding manifest-list files, but not the actual data.
I used both ExpireSnapshotsAction and RemoveOrphanFilesAction
Please help
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (8 by maintainers)
Top Results From Across the Web
Maintenance - Apache Iceberg
Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no ......
Read more >ExpireSnapshots deletes active files · Issue #2131 - GitHub
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete. This does not delete data files. To...
Read more >[GitHub] [iceberg] RussellSpitzer edited a ... - The Mail Archive
It can only remove your ability to query previous versions of the tables. If a file was left behind either expire snapshots had...
Read more >Maintaining Iceberg Tables - Compaction, Expiring Snapshots ...
At a certain point, you may want to expire snapshots you no longer ... Since a table's files can also be stored outside...
Read more >How to actually delete files in Iceberg - Stack Overflow
We often need an asynchronous task to delete and expire snapshots\data. ... as shay saied. you can also do this using the java...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
ExpireSnapshotActions will always only remove files that were previously part of the Iceberg Table. RemoveOrphanFiles will remove those files and any other files that are not explicitly part of the table.
Let’s take an example
I have a Directory /myTable/
I have preformed several commits
But lets say we had a bit of an error and we also wrote BrokenFile in a run that got canceled, it was never added to the table but the file was created.
The directory now has five files from Iceberg (and metadata files)
If I run expire snapshots and remove snapshot 1 then it checks for all files that were referred to by snapshot 1 (A, B) that were not referred to by Snapshot 2 (B, C) or Snapshot 3 (B, D). This is a single file,
So only File A should be deleted (along with metadata for Snapshot 2 and Snapshot 3).
If I run expire snapshots and remove snapshot 2 and 1 we do a similar thing.
So in this case expire snapshots removes two data files, A and C.
Neither of these operations were able to remove “BrokenFile” it was never listed, so it can never be picked up by this operation. Let’s say that we expired Snapshots 2 and 1, but the delete operation failed so the files were never removed.
Now the table just looks like
But our directory still has A, B, C, D and Broken File
Remove Orphan Files can clean this up for us because it does not use the snapshots which are removed to determine which files to remove. Instead it lists all the files in the table location (A, B, C, D, BrokenFile) and then deletes all files which are not referenced by the table. Currently the table only has 1 snapshot, snapshot 3 (B,D).
So RemoveOrphanFiles will remove 3 files. A, C and BrokenFile
So this is why I consider Remove Orphan Files to be a superset of what ExpireSnapshot removes. I guess it is more correct to say RemoveOrphanFiles will remove all files that should have been removed after ExpireSnapshots even if ExpireSnpashots fails to delete those files for some reason.
Expire snapshots removes the history and then all the files which were only reachable by that history. Remove OrphanFiles looks at all of the current history and compares it to a raw directory listing. Remove orphan files can clean up a failed ExpireSnapshots, but not the reverse. Remove orphan files is more dangerous because there is a possibility that your table location has files from other projects or the paths have changed in some subtle way that matches on resolution but does not string match. For example if you store paths without authority and change authorities you may have files which are the same, but do not string match correctly.
TLDR;
Expire Snapshots will never remove a file that Iceberg will need to read the table with a small caveat for tables which are created as Snapshots of other tables. See GC_ENABLED in table properties.
RemoveOrphanFiles should never remove a file that Iceberg will need to read the table, but since it uses string matching to determine which files to remove there is a chance it can remove necessary files which is why a dry-run flag was introduced. It also will remove any other files in the table location even if they were never part of the Iceberg table.
@RussellSpitzer thanks for the help and detailed explanation.