Iceberg DROP table not removing data from s3
See original GitHub issueHi, I am using presto (version 343) with Iceberg connector and found issue:
Tables which was created as CREATE TABLE ... or CREATE TABLE AS SELECT... then dropped - it will remove data only from Hive Metastore but not from s3.
Then if we try to create same table again - we will have table already exists exception
Example:
CREATE TABLE iceberg.test_table (id varchar(25));
INSERT INTO iceberg.test_table values ('1');
DROP TABLE iceberg.test_table
-- Table removed from Hive metastore, but files still exists in s3
Note:
In connector config we have hive.metastore.thrift.delete-files-on-drop=true
Issue investigation results
- CREATE TABLE - when we create table (see HiveTableOperations), we have hardcoded TableType=EXTERNAL_TABLE and parameter (EXTERNAL = TRUE)
Table.Builder builder = Table.builder()
...
.setTableType(TableType.EXTERNAL_TABLE.name())
...
.setParameter("EXTERNAL", "TRUE")
...
- DROP TABLE - we have 2 places which responsible for data deletion from s3
2.1 First one is Hive Metastore - it will delete files from EXTERNAL table only if we passed
external.table.purge=TRUEparameter to it 2.2. Second place is ThriftHiveMetastore, but as we have hardcoded TableType=EXTERNAL_TABLE - this code will be skipped
if (deleteFilesOnDrop && deleteData && isManagedTable(table) && !isNullOrEmpty(tableLocation)) {
deleteDirRecursive(hdfsContext, hdfsEnvironment, new Path(tableLocation));
}
#isManagedTable(table) -> false, because on table create we set tableType = EXTERNAL_TABLE
As a result data from s3 not deleted
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:12 (11 by maintainers)
Top Results From Across the Web
Iceberg DROP table not removing data from s3 #5616 - GitHub
Another solution is to stop checking whether the path exists. Iceberg tables won't conflict with other uses of a prefix and can clean...
Read more >Managing Iceberg tables - Amazon Athena
Drops an Iceberg table. ... Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data...
Read more >Dropping Apache Iceberg Tables - Dremio docs
For Amazon S3 data sources, the DROP TABLE command logically removes a table from the source and physically removes all files associated with...
Read more >Failed to delete complete data of the iceberg tables in athena
whenever i performed delete operation every time without any records in the table, all the records should be deleted. i do have a...
Read more >Drop table feature - Cloudera Documentation
table.purge flag is set to false. Dropping a table purges the schema only. The actual data is not removed. You can explicitly set...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@sshkvar, that is the correct behavior for Hive tables, but not for Iceberg tables. I think Iceberg tables should continue to use external and the Iceberg connector should clean up the files.
I’m going to put an item on the agenda for the next Iceberg community sync to talk about prefix ownership. There are some good questions here: