Reading timestamp with type INT64 in parquet file

See original GitHub issue

I am using AWS EMR 5.28.0 which has Presto 0.227 (Somehow, it works in AWS Athena.) I have an AWS DMS job that migrates the AWS RDS data to S3. While migrating it also converts to Parquet. When setting target endpoint in S3, I have tried all combinations of parameters parquetTimestampInMillisecond and parquetVersion, but none works.

Ideally, when I set AWS parquetTimestampInMillisecond to true, I should be able to read any timestamp column from Presto. But, I get the following error: The column columnName is declared as type timestamp, but the Parquet file declares the column as type INT64

Using parquet-tools, I see following output for the timestamp column in my parquet file in S3: columnName: INT64 UNCOMPRESSED DO:0 FPO:99321 SZ:24381/24381/1.00 VC:3041 ENC:PLAIN,RLE ST:[min: 2019-11-26T04:21:44.403, max: 2019-11-26T04:22:45.932, num_nulls: 0] When I use following definition to create table, it works

CREATE TABLE hive.bigdata.tableName (
   columnName bigint
)
WITH (                                                                                          
    external_location = 's3://location', 
    format = 'PARQUET'                                                                           
 )

But I have to create a view to read column as timestamp,

CREATE VIEW hive.bigdata.tableName2 AS SELECT
cast(date_format(from_unixtime(columnName/1000), '%Y-%m-%d %h:%i:%s') as timestamp) as columnName
FROM hive.bigdata.tableName

Is there any direct way to create a table from the INT64 column? or any way to tell AWS DMS to transfer in supported data type.

Issue Analytics

State:
Created 4 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

ryanruppcommented, Dec 10, 2019

@findepi there may be something more going on here but the error message mentioned is from a Facebook Presto specific check (linked above) that is relatively new so doesn’t exist in PrestoSQL. There are some other pending pull requests like allowing statistics pruning for timestamps that are int64 so I guess I’d be surprised if it’s broken. It is possible it’s something like AWS DMS creates the type as int64 but doesn’t tag it as a logical type of TIMESTAMP_MILLIS or something along those lines but the Parquet tools metadata there shows it rendering correctly the min/max stats for int64 in timestamp format (I think that means logical type is applied correctly).

@prakharjain-vogo would it be possible to test locally with the latest PrestoSQL (I know this is not on EMR unfortunately)

0reactions

ryanruppcommented, Jan 3, 2020

Going to close as this was an issue in PrestoDB that got fixed in the previously mentioned linked issue. @prakharjain-vogo feel free to reopen if this is still an issue with PrestoSQL.