Reading timestamp with type INT64 in parquet file

See original GitHub issue

I am using AWS EMR 5.28.0 which has Presto 0.227 (Somehow, it works in AWS Athena.) I have an AWS DMS job that migrates the AWS RDS data to S3. While migrating it also converts to Parquet. When setting target endpoint in S3, I have tried all combinations of parameters parquetTimestampInMillisecond and parquetVersion, but none works.

Ideally, when I set AWS parquetTimestampInMillisecond to true, I should be able to read any timestamp column from Presto. But, I get the following error: The column columnName is declared as type timestamp, but the Parquet file declares the column as type INT64

Using parquet-tools, I see following output for the timestamp column in my parquet file in S3: columnName: INT64 UNCOMPRESSED DO:0 FPO:99321 SZ:24381/24381/1.00 VC:3041 ENC:PLAIN,RLE ST:[min: 2019-11-26T04:21:44.403, max: 2019-11-26T04:22:45.932, num_nulls: 0] When I use following definition to create table, it works

CREATE TABLE hive.bigdata.tableName (
   columnName bigint
)
WITH (                                                                                          
    external_location = 's3://location', 
    format = 'PARQUET'                                                                           
 )

But I have to create a view to read column as timestamp,

CREATE VIEW hive.bigdata.tableName2 AS SELECT
cast(date_format(from_unixtime(columnName/1000), '%Y-%m-%d %h:%i:%s') as timestamp) as columnName
FROM hive.bigdata.tableName

Is there any direct way to create a table from the INT64 column? or any way to tell AWS DMS to transfer in supported data type.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ryanruppcommented, Dec 10, 2019

@findepi there may be something more going on here but the error message mentioned is from a Facebook Presto specific check (linked above) that is relatively new so doesn’t exist in PrestoSQL. There are some other pending pull requests like allowing statistics pruning for timestamps that are int64 so I guess I’d be surprised if it’s broken. It is possible it’s something like AWS DMS creates the type as int64 but doesn’t tag it as a logical type of TIMESTAMP_MILLIS or something along those lines but the Parquet tools metadata there shows it rendering correctly the min/max stats for int64 in timestamp format (I think that means logical type is applied correctly).

@prakharjain-vogo would it be possible to test locally with the latest PrestoSQL (I know this is not on EMR unfortunately)

0reactions
ryanruppcommented, Jan 3, 2020

Going to close as this was an issue in PrestoDB that got fixed in the previously mentioned linked issue. @prakharjain-vogo feel free to reopen if this is still an issue with PrestoSQL.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas - Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
I read parquet files using the pyspark function sqlContext.read.parquet . The files contain timestamp columns. Due to the fact that pyspark ...
Read more >
TIMESTAMP compatibility for Parquet files | CDP Public Cloud
When writing Parquet files, Hive and Spark SQL both normalize all TIMESTAMP values to the UTC time zone. During a query, Spark SQL...
Read more >
Parquet timestamp and Athena Query | by Anand Prakash
In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp...
Read more >
Parquet timestamp and Athena Query | Learn. Share. Repeat.
In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp...
Read more >
Apache Spark job fails with Parquet column cannot be ...
The read schema uses atomic data types: binary, boolean, date, string, and timestamp. Note. This error only occurs if you have decimal type...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found