Does 'expireSnapshots' also remove data files ?

See original GitHub issue
Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small.

so, I executed expireSnapshots did it also automatically marked my data files as deleted? after expireSnapshots I executed removeOrphanFiles and some folders in s3 became empty. I thought that expireSnapshots only removes *.metadata.json and corresponding manifest-list files, but not the actual data.

I used both ExpireSnapshotsAction and RemoveOrphanFilesAction

Please help

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
RussellSpitzercommented, Jul 8, 2021

ExpireSnapshotActions will always only remove files that were previously part of the Iceberg Table. RemoveOrphanFiles will remove those files and any other files that are not explicitly part of the table.

Let’s take an example

I have a Directory /myTable/

I have preformed several commits

1. Add File A, B. - Table is A, B
2. Remove File A, Add File C. - Table is B, C
3. Remove File C, Add File D - Table is B, D

But lets say we had a bit of an error and we also wrote BrokenFile in a run that got canceled, it was never added to the table but the file was created.

The directory now has five files from Iceberg (and metadata files)

A, B , C, D and BrokenFile

If I run expire snapshots and remove snapshot 1 then it checks for all files that were referred to by snapshot 1 (A, B) that were not referred to by Snapshot 2 (B, C) or Snapshot 3 (B, D). This is a single file,

(A, B) but not (B, C, D) = (A). 

So only File A should be deleted (along with metadata for Snapshot 2 and Snapshot 3).

If I run expire snapshots and remove snapshot 2 and 1 we do a similar thing.

(A, B, C) but not (B, D) = (A, C)

So in this case expire snapshots removes two data files, A and C.

Neither of these operations were able to remove “BrokenFile” it was never listed, so it can never be picked up by this operation. Let’s say that we expired Snapshots 2 and 1, but the delete operation failed so the files were never removed.

Now the table just looks like

3. Remove C, Add D - Table is B, D

But our directory still has A, B, C, D and Broken File

Remove Orphan Files can clean this up for us because it does not use the snapshots which are removed to determine which files to remove. Instead it lists all the files in the table location (A, B, C, D, BrokenFile) and then deletes all files which are not referenced by the table. Currently the table only has 1 snapshot, snapshot 3 (B,D).

(A, B, C, D, Broken File) but not (B, D) = (A, C, BrokenFile)

So RemoveOrphanFiles will remove 3 files. A, C and BrokenFile

So this is why I consider Remove Orphan Files to be a superset of what ExpireSnapshot removes. I guess it is more correct to say RemoveOrphanFiles will remove all files that should have been removed after ExpireSnapshots even if ExpireSnpashots fails to delete those files for some reason.

Expire snapshots removes the history and then all the files which were only reachable by that history. Remove OrphanFiles looks at all of the current history and compares it to a raw directory listing. Remove orphan files can clean up a failed ExpireSnapshots, but not the reverse. Remove orphan files is more dangerous because there is a possibility that your table location has files from other projects or the paths have changed in some subtle way that matches on resolution but does not string match. For example if you store paths without authority and change authorities you may have files which are the same, but do not string match correctly.


TLDR;

Expire Snapshots will never remove a file that Iceberg will need to read the table with a small caveat for tables which are created as Snapshots of other tables. See GC_ENABLED in table properties.

RemoveOrphanFiles should never remove a file that Iceberg will need to read the table, but since it uses string matching to determine which files to remove there is a chance it can remove necessary files which is why a dry-run flag was introduced. It also will remove any other files in the table location even if they were never part of the Iceberg table.

0reactions
dmgcodevilcommented, Jul 9, 2021

@RussellSpitzer thanks for the help and detailed explanation.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Maintenance - Apache Iceberg
Snapshots accumulate until they are expired by the expireSnapshots operation. Regularly expiring snapshots is recommended to delete data files that are no ......
Read more >
ExpireSnapshots deletes active files · Issue #2131 - GitHub
Expires snapshots and commits the changes to the table, returning a Dataset of files to delete. This does not delete data files. To...
Read more >
[GitHub] [iceberg] RussellSpitzer edited a ... - The Mail Archive
It can only remove your ability to query previous versions of the tables. If a file was left behind either expire snapshots had...
Read more >
Maintaining Iceberg Tables - Compaction, Expiring Snapshots ...
At a certain point, you may want to expire snapshots you no longer ... Since a table's files can also be stored outside...
Read more >
How to actually delete files in Iceberg - Stack Overflow
We often need an asynchronous task to delete and expire snapshots\data. ... as shay saied. you can also do this using the java...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found