How to read ORC file in chunks

  apache, c++, orc, parquet

I am trying to read ORC files in chunks. I have a very large ORC file on disk (say 100GB) and very limited memory (e.g., I can buffer max 1MB data in memory). I want to scan ORC file intelligently:

  1. read footer
  2. get addresses of stripes
  3. read first stripe’s metadata (footer) and apply some filters
  4. read first stripe’s index
  5. read first stripe’s data (chunk by chunk – 1MB at a time)
  6. Move to the next stripe

I have tried to use MemoryInputStream.hh from the ORC repo:

https://github.com/apache/orc/blob/main/c++/test/MemoryInputStream.hh

However, while reading the data, its read method tries to access large amounts of data (beyond 1MB).

    virtual void read(void* buf, uint64_t length, uint64_t offset) override {
      memcpy(buf, buffer + offset, length);
    }


So, is there a way to read/aceess different parts of the ORC file incrementally and with a limited in-memory buffer?

Thanks!

Source: Windows Questions C++

LEAVE A COMMENT