Required field 'uncompressed_page_size' was not found in serialized data! Struct:

See original GitHub issue

Hi please refer below error when accessing parquet kindly help . Sample file i have enclosed from haggle download and tested same below error is appearing in presto cli.

I am using 326 version …

io.prestosql.spi.PrestoException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@33fc99f6
        at io.prestosql.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:167)
        at io.prestosql.spi.block.LazyBlock$LazyData.load(LazyBlock.java:378)
        at io.prestosql.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:357)
        at io.prestosql.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:275)
        at io.prestosql.spi.Page.getLoadedPage(Page.java:261)
        at io.prestosql.operator.TableScanOperator.getOutput(TableScanOperator.java:290)
        at io.prestosql.operator.Driver.processInternal(Driver.java:379)
        at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
        at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
        at io.prestosql.operator.Driver.processFor(Driver.java:276)
        at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
        at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
        at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
        at io.prestosql.$gen.Presto_326____20191205_193016_2.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@33fc99f6
        at org.apache.parquet.format.Util.read(Util.java:216)
        at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
        at io.prestosql.parquet.reader.ParquetColumnChunk.readPageHeader(ParquetColumnChunk.java:57)
        at io.prestosql.parquet.reader.ParquetColumnChunk.readAllPages(ParquetColumnChunk.java:67)
        at io.prestosql.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:256)
        at io.prestosql.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:310)
        at io.prestosql.parquet.reader.ParquetReader.readBlock(ParquetReader.java:293)
        at io.prestosql.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:161)
        ... 16 more
Caused by: io.prestosql.hive.$internal.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@33fc99f6
        at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1055)
        at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
        at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
        at org.apache.parquet.format.Util.read(Util.java:213)
        ... 23 more

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
sib19commented, Jun 8, 2020

Hi hashhar Thanks for the information yes I experienced same kind of problem… underlying data platform jars doesn’t support the Huge table data read, the vendor has fixed the issue no changes on presto part…

0reactions
erikcwcommented, May 23, 2022

I’m experiencing the same problam on Trino 381 (on kubernetes using the community container image). Everything works fine until I activate caching for the hive connector. Then after a few successful queries, they start failing with:

io.trino.spi.TrinoException: Failed reading parquet data; source= s3://*************/master-db/master-email-20201002/part-00617-95db1ec9-45be-47c3-9eb8-5d2906ed5389-c000.snappy.parquet; can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@2e0bf6f0
	at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:217)
	at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:400)
	at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:379)
	at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:286)
	at io.trino.operator.project.DictionaryAwarePageProjection$DictionaryAwarePageProjectionWork.setupDictionaryBlockProjection(DictionaryAwarePageProjection.java:211)
	at io.trino.operator.project.DictionaryAwarePageProjection$DictionaryAwarePageProjectionWork.lambda$getResult$0(DictionaryAwarePageProjection.java:197)
	at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:400)
	at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:379)
	at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:286)
	at io.trino.operator.project.PageProcessor$ProjectSelectedPositions.processBatch(PageProcessor.java:345)
	at io.trino.operator.project.PageProcessor$ProjectSelectedPositions.process(PageProcessor.java:208)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils.lambda$flatten$7(WorkProcessorUtils.java:296)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils.lambda$flatten$7(WorkProcessorUtils.java:296)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:338)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:325)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$3(WorkProcessorUtils.java:219)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:240)
	at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$4(WorkProcessorUtils.java:234)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:391)
	at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:150)
	at io.trino.operator.Driver.processInternal(Driver.java:410)
	at io.trino.operator.Driver.lambda$process$10(Driver.java:313)
	at io.trino.operator.Driver.tryWithLock(Driver.java:698)
	at io.trino.operator.Driver.process(Driver.java:305)
	at io.trino.operator.Driver.processForDuration(Driver.java:276)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1092)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:488)
	at io.trino.$gen.Trino_381____20220523_050344_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@2e0bf6f0
	at org.apache.parquet.format.Util.read(Util.java:365)
	at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
	at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
	at io.trino.parquet.reader.ParquetColumnChunk.readPageHeader(ParquetColumnChunk.java:78)
	at io.trino.parquet.reader.ParquetColumnChunk.readAllPages(ParquetColumnChunk.java:91)
	at io.trino.parquet.reader.ParquetReader.createPageReader(ParquetReader.java:385)
	at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:365)
	at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:441)
	at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:424)
	at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:211)
	... 42 more
Caused by: io.trino.hive.$internal.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@2e0bf6f0
	at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
	at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
	at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
	at org.apache.parquet.format.Util.read(Util.java:362)
	... 51 more

Here is my hive catalog config WITHOUT caching enabled (queries work just fine):

                  hive.metastore-refresh-interval=50ms
                  connector.name=hive-hadoop2
                  hive.metastore-cache-ttl=50ms
                  hive.non-managed-table-writes-enabled = true
                  hive.hdfs.impersonation.enabled = true
                  hive.recursive-directories = true
                  hive.parquet.use-column-names = true
                  hive.metastore = glue

And with WITH caching enabled (causes the exception):

                  hive.metastore-refresh-interval=50ms
                  connector.name=hive-hadoop2
                  hive.metastore-cache-ttl=50ms
                  hive.non-managed-table-writes-enabled = true
                  hive.hdfs.impersonation.enabled = false
                  hive.recursive-directories = true
                  hive.parquet.use-column-names = true
                  hive.metastore = glue
                  hive.cache.enabled=true
                  hive.cache.location=/mnt/trino-cache

CREATE EXTERNAL TABLE `master_email`(
  `email` string COMMENT '', 
  `sha256_lower` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://***********/master-db/master-email-20201002/'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'STATS_GENERATED_VIA_STATS_TASK'='workaround for potential lack of HIVE-12730', 
  'has_encrypted_data'='false', 
  'last_modified_by'='hadoop', 
  'last_modified_time'='1601685042', 
  'numFiles'='0', 
  'numRows'='1700455799', 
  'spark.sql.create.version'='2.2 or prior', 
  'spark.sql.sources.schema.numParts'='1', 
  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"email\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"comment\":\"\"}},{\"name\":\"sha256_lower\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"comment\":\"\"}}]}', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1601685042')
% parquet-tools meta /tmp/part-00617-95db1ec9-45be-47c3-9eb8-5d2906ed5389-c000.snappy.parquet                                                                                                                                                                              <<<
file:         file:/tmp/part-00617-95db1ec9-45be-47c3-9eb8-5d2906ed5389-c000.snappy.parquet
creator:      parquet-mr version 1.10.1 (build 0f9df43deffb88fd7f8666d63691842330f46bd9)
extra:        org.apache.spark.version = 3.0.0
extra:        org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"email","type":"string","nullable":true,"metadata":{}},{"name":"sha256_lower","type":"string","nullable":true,"metadata":{}}]}

file schema:  spark_schema
--------------------------------------------------------------------------------
email:        OPTIONAL BINARY L:STRING R:0 D:1
sha256_lower: OPTIONAL BINARY L:STRING R:0 D:1

row group 1:  RC:1574299 TS:149868008 OFFSET:4
--------------------------------------------------------------------------------
email:         BINARY SNAPPY DO:0 FPO:4 SZ:30340825/42801253/1.41 VC:1574299 ENC:PLAIN,BIT_PACKED,RLE ST:[min: "2357259a5a25d8f40e6ae87224a1c13f *******@******.com ""[vp[e>0\'_f`j(+{{$lt@zd_s6&rir", max: ******@******m.net, num_nulls: 0]
sha256_lower:  BINARY SNAPPY DO:0 FPO:30340829 SZ:103870684/107066755/1.03 VC:1574299 ENC:PLAIN,BIT_PACKED,RLE ST:[min: **********, max: ********.us", num_nulls: 0]

row group 2:  RC:1259161 TS:117656247 OFFSET:134211513
--------------------------------------------------------------------------------
email:         BINARY SNAPPY DO:0 FPO:134211513 SZ:20885557/32040363/1.53 VC:1259161 ENC:PLAIN,BIT_PACKED,RLE ST:[min: ********@comcast.com, max: **********@yahoo.com, num_nulls: 0]
sha256_lower:  BINARY SNAPPY DO:0 FPO:155097070 SZ:83155069/85615884/1.03 VC:1259161 ENC:PLAIN,BIT_PACKED,RLE ST:[min: com", max: ystic@msn.com", num_nulls: 0]
Read more comments on GitHub >

github_iconTop Results From Across the Web

Required field 'uncompressed_page_size' was not found in ...
In my own case, I was writing a custom Parquet Parser for Apache Tika and I experienced this error. It turned out that...
Read more >
Required field 'uncompressed_page_size' was not ... - GitHub
Hi please refer below error when accessing parquet kindly help . Sample file i have enclosed from haggle download and tested same below...
Read more >
PageHeader.isSetUncompressed_page_size - Java - Tabnine
isSetUncompressed_page_size (Showing top 6 results out of 315) ... field 'uncompressed_page_size' was not found in serialized data! Struct: " + toString());.
Read more >
Spark 3.0.1 encounter parquet PageHerder IO issue - Apache
PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.
Read more >
org.apache.parquet.format.PageHeader ... - Download JAR files
Autogenerated by Thrift Compiler (0.12.0) * * DO NOT EDIT UNLESS YOU ARE SURE THAT YOU ... field 'uncompressed_page_size' was not found in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found