Iceberg DROP table not removing data from s3

See original GitHub issue

Hi, I am using presto (version 343) with Iceberg connector and found issue:

Tables which was created as CREATE TABLE ... or CREATE TABLE AS SELECT... then dropped - it will remove data only from Hive Metastore but not from s3. Then if we try to create same table again - we will have table already exists exception

Example:

CREATE TABLE iceberg.test_table (id varchar(25));
INSERT INTO iceberg.test_table values ('1');
DROP TABLE iceberg.test_table

-- Table removed from Hive metastore, but files still exists in s3

Note: In connector config we have hive.metastore.thrift.delete-files-on-drop=true

Issue investigation results

  1. CREATE TABLE - when we create table (see HiveTableOperations), we have hardcoded TableType=EXTERNAL_TABLE and parameter (EXTERNAL = TRUE)
             Table.Builder builder = Table.builder()
                        ...
                        .setTableType(TableType.EXTERNAL_TABLE.name())
                        ...
                        .setParameter("EXTERNAL", "TRUE")
                        ...
  1. DROP TABLE - we have 2 places which responsible for data deletion from s3 2.1 First one is Hive Metastore - it will delete files from EXTERNAL table only if we passed external.table.purge=TRUE parameter to it 2.2. Second place is ThriftHiveMetastore, but as we have hardcoded TableType=EXTERNAL_TABLE - this code will be skipped
  if (deleteFilesOnDrop && deleteData && isManagedTable(table) && !isNullOrEmpty(tableLocation)) {
                                deleteDirRecursive(hdfsContext, hdfsEnvironment, new Path(tableLocation));
                            }
#isManagedTable(table) -> false, because on table create we set tableType = EXTERNAL_TABLE

As a result data from s3 not deleted

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
rdbluecommented, Nov 24, 2020

@sshkvar, that is the correct behavior for Hive tables, but not for Iceberg tables. I think Iceberg tables should continue to use external and the Iceberg connector should clean up the files.

1reaction
rdbluecommented, Dec 7, 2020

I’m going to put an item on the agenda for the next Iceberg community sync to talk about prefix ownership. There are some good questions here:

  1. Should we assume that a table owns a prefix and can delete it when it is removed?
  2. If yes, where should that happen? If not, should we have a config property to turn it on?
Read more comments on GitHub >

github_iconTop Results From Across the Web

Iceberg DROP table not removing data from s3 #5616 - GitHub
Another solution is to stop checking whether the path exists. Iceberg tables won't conflict with other uses of a prefix and can clean...
Read more >
Managing Iceberg tables - Amazon Athena
Drops an Iceberg table. ... Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data...
Read more >
Dropping Apache Iceberg Tables - Dremio docs
For Amazon S3 data sources, the DROP TABLE command logically removes a table from the source and physically removes all files associated with...
Read more >
Failed to delete complete data of the iceberg tables in athena
whenever i performed delete operation every time without any records in the table, all the records should be deleted. i do have a...
Read more >
Drop table feature - Cloudera Documentation
table.purge flag is set to false. Dropping a table purges the schema only. The actual data is not removed. You can explicitly set...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found