feat: Implement an AsyncReader for avro using ObjectStore #8930

EmilyMatt · 2025-11-26T10:35:05Z

Which issue does this PR close?

Closes Implement an async AvroReader #8929 .

Rationale for this change

Allows for proper file splitting within an asynchronous context.

What changes are included in this PR?

The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema

Are these changes tested?

Yes

Are there any user-facing changes?

Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)

jecsand838

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

arrow-avro/Cargo.toml

arrow-avro/src/reader/async_reader.rs

EmilyMatt · 2025-11-26T21:26:03Z

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

jecsand838 · 2025-11-26T21:33:18Z

Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

100% I'm working on that right now and won't stop until I have a PR. That was a solid catch.

The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back.

arrow-avro/src/reader/async_reader.rs

EmilyMatt · 2025-12-01T21:46:18Z

Sorry, I haven't dropped it, just found myself in a really busy week! The generic reader support does not seem to hard to implement from the dabbling I made, and I still need to get to the builder pattern change

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

EmilyMatt · 2025-12-07T15:10:23Z

@jecsand838 I believe this is now ready for a proper review^

jecsand838

@EmilyMatt Thank you so much for getting these changes up!

I left a few comments. Let me know what you think.

EDIT: Should have mentioned that this is looking really good overall and I'm very excited for the AsyncReader!

arrow-avro/Cargo.toml

arrow-avro/src/reader/async_reader/async_file_reader.rs

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/async_file_reader.rs

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/object_store_reader.rs

alamb · 2026-01-10T13:12:10Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

EmilyMatt · 2026-01-11T12:24:35Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

I had actually just returned to work on it 2 days ago, still having some issues with the schema now being provided, due to the problems I've described, @jecsand838 suggested removing the arrow schema and I'm starting to think that is the only viable way for now.
Making the fetch API a bit closer to the one parquet uses is the smaller issue, I do wish to keep the seperate semantics for the original fetch and extra fetch(for parquet for example, that will be the row groups ranges, and the footer range), will try a couple ways to do this

EmilyMatt · 2026-01-11T12:24:52Z

Hope to push another version today and address some of the things above

…sync-reader

EmilyMatt · 2026-01-12T21:44:20Z

@jecsand838 I've shamelessly plagiarized the API for the object reader from the parquet crate, but that's ok IMO, it lays the foundations for a common API in a few versions.
I believe I've addressed everything, let me know if anything pops to mind

jecsand838 · 2026-01-15T23:41:15Z

@jecsand838 I've shamelessly plagiarized the API for the object reader from the parquet crate, but that's ok IMO, it lays the foundations for a common API in a few versions. I believe I've addressed everything, let me know if anything pops to mind

@EmilyMatt Absolutely, I'll have time to give this a solid review tonight. Ty for getting these changes in!

jecsand838

@EmilyMatt Flushing a partial review here. Looking really good overall. I know this is a huge PR and your high-level changes look great.

I left more code-level comments with one architectural call-out you may want to consider. Overall this is looking solid and I'm super stoked about this async reader for arrow-avro.

arrow-avro/Cargo.toml

arrow-avro/README.md

arrow-avro/src/lib.rs

arrow-avro/src/reader/async_reader/mod.rs

jecsand838 · 2026-01-16T11:43:55Z

arrow-avro/src/reader/async_reader/mod.rs

+                ReaderState::Limbo => {
+                    unreachable!("ReaderState::Limbo should never be observed");
+                }


Is the ReaderState::Limbo variant really necessary? Could we use Finished and if a bug causes an early return without setting state, the stream just ends (which is safer than panicking)?

A user will never know they did not actually finish writing the file, they will think they've just reached the end, this is in my opinion times of magnitude more severe than crashing.

I mean in the event of that occurring an ArrowError should be passed back which would alert the user.

I might be misunderstanding but you are suggesting two opposing things at the same time.

I can replace the panic with returning an error, but limbo should never exist. programatically, there should never be any code path where state can be limbo, and that is the purpose of the unreachable!() macro.

On the other hand, setting it to Finished means adding a risk of returning a Ready(None) by accident.

@EmilyMatt

I might be misunderstanding but you are suggesting two opposing things at the same time.

I can replace the panic with returning an error, but limbo should never exist. programatically, there should never be any code path where state can be limbo, and that is the purpose of the unreachable!() macro.

On the other hand, setting it to Finished means adding a risk of returning a Ready(None) by accident.

Sorry I should have explained this better!

My thinking was it maybe better to simply catch the error and terminate early in a manner that bubbles the error back up to the caller (via the ReaderState::Finish or even a new ReaderState::Errored(ArrowError) variant) instead of assuming Limbo is and will always be unreachable. Granted you raise a fair point about returning a Ready(None) by accident, so that's something to be cautious of.

There's two big reasons this caught my attention, the first is despite being a solid idiomatic pattern in Rust, by using unreachable! we risk turning an internal bug into a process-aborting panic for downstream users. This is compounded by arrow-rs generally operating rather low in many stacks.

The second reason this caught my attention is there seems to be multiple early return paths (including ?, etc.) that can (either now or in the future) leave self.reader_state stuck as Limbo and potentially cause a panic on the next poll_next.

As an aside, here's two additional options to consider (if we want to continue going after getting an error):

Keeping Limbo and eliminating ? inside the replaced state match with manual error handling that restores a real state.

Refactoring to not mem::replace the entire enum.

I believe 2 is impossible, due to ownership semantics of async behaviour, this can be observed in the other async readers as well, they all usually do something similar when implementing streams manually.

Regarding 1. I was sure polling again after error violates the stream contract, but perhaps I was mistaken, in which case I apologize for speaking too confidently about there being no such code paths^^

Do you think an implementation such as the one I've pushed now is viable? or perhaps it is best to merge the Error state and invalid state?
I feel like using Finished is too risky for the reasons I've previously explained

@EmilyMatt

Regarding 1. I was sure polling again after error violates the stream contract, but perhaps I was mistaken, in which case I apologize for speaking too confidently about there being no such code paths^^

All good! Polling again after an error item is not a Stream contract violation that I'm aware of. The Stream contract only explicitly warns about polling again after the stream has terminated (i.e. after Ready(None)). With Stream<Item = Result<_, _>>, an Err is just another yielded item, and it’s valid for callers to keep polling unless the stream returns None.

I believe 2 is impossible, due to ownership semantics of async behaviour, this can be observed in the other async readers as well, they all usually do something similar when implementing streams manually.

I agree that removing mem::replace on the entire enum isn’t always practical in these manual state machines -- you can do it, but I get that will require a large refactor (pin-projection / splitting state / Options, etc.). The mem::replace pattern is totally reasonable here, just wanted to make sure we're careful about panics in the hot path of such an upstream project.

Do you think an implementation such as the one I've pushed now is viable? or perhaps it is best to merge the Error state and invalid state?
I feel like using Finished is too risky for the reasons I've previously explained

On the current implementation: I glanced over the latest changes and I think the updated InvalidState sentinel approach + removing ?/early returns inside the replaced-state match is viable. I'll conduct a follow-up code review right now, but I'm pretty much inclined to approve this.

arrow-avro/src/reader/async_reader/store.rs

parquet/src/arrow/array_reader/mod.rs

parquet/src/arrow/schema/extension.rs

arrow-avro/src/reader/async_reader/mod.rs

# Conflicts: # arrow-avro/src/reader/mod.rs

mzabaluev · 2026-01-19T23:05:53Z

I get a ParseError("bad varint") on this test:

    fn get_int_array_schema() -> SchemaRef {
        let schema = Schema::new(vec![Field::new(
            "int_array",
            DataType::List(Arc::new(Field::new("element", DataType::Int32, true))),
            true,
        )])
        .with_metadata(HashMap::from([("avro.name".into(), "table".into())]));
        Arc::new(schema)
    }

    #[tokio::test]
    async fn test_bad_varint_bug() {
        let file = arrow_test_data("avro/bad-varint-bug.avro");

        let schema = get_int_array_schema();
        let batches = read_async_file(&file, 1024, None, schema).await.unwrap();
        let _batch = &batches[0];
    }

The Avro file, readable by Spark: bad-varint-bug.avro.gz

mzabaluev-flarion · 2026-01-19T23:36:21Z

The Avro file, readable by Spark: bad-varint-bug.avro.gz

I have checked the Avro file is readable with Python avro 1.12.1:

>>> from avro.datafile import DataFileReader
>>> from avro.io import DatumReader
>>> reader = DataFileReader(open("testing/data/avro/bad-varint-bug.avro", "rb"), DatumReader())
>>> for rec in reader:
...     print(rec)
... 
{'int_array': [1, 2]}

EmilyMatt · 2026-01-20T08:19:08Z

The Avro file, readable by Spark: bad-varint-bug.avro.gz

I have checked the Avro file is readable with Python avro 1.12.1:
>>> from avro.datafile import DataFileReader
>>> from avro.io import DatumReader
>>> reader = DataFileReader(open("testing/data/avro/bad-varint-bug.avro", "rb"), DatumReader())
>>> for rec in reader:
...     print(rec)
... 
{'int_array': [1, 2]}

I don't think this is a bug in the async reader.
You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

I can confirm the following test passes:

#[tokio::test]
    async fn test_bad_varint_bug() {
        let store: Arc<dyn ObjectStore> = Arc::new(LocalFileSystem::new());
        let location = Path::from_filesystem_path("/home/emily/Downloads/bad-varint-bug.avro").unwrap();

        let file_size = store.head(&location).await.unwrap().size;

        let file_reader = AvroObjectReader::new(store, location);
        let reader = AsyncAvroFileReader::builder(file_reader, file_size, 1024)
            .try_build()
            .await.unwrap();;

        let batches: Vec<RecordBatch> = reader.try_collect().await.unwrap();
        let batch = &batches[0];
        let int_list_col = batch.column(0).as_list::<i32>();

        let first_list = int_list_col.value(0);
        let expected_result = Arc::new(Int32Array::from_iter_values(vec![1i32, 2])) as _;
        assert_eq!(first_list, expected_result)
    }

The issue is probably in the AvroSchema::from
it has various bugs I've also encountered.

mzabaluev · 2026-01-20T13:08:46Z

I don't think this is a bug in the async reader. You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient.
The problem was in a schema mismatch: in the file, the array elements are not nullable.

EmilyMatt · 2026-01-20T19:10:10Z

I don't think this is a bug in the async reader. You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient. The problem was in a schema mismatch: in the file, the array elements are not nullable.

It is not necessarily sufficient.
See
#8928

But you should open a bug for this. since a reader schema with nullables and writer schema with non-nullables should be compatible.

jecsand838 · 2026-01-21T00:40:14Z

@mzabaluev

I don't think this is a bug in the async reader. You are using a testing infrastructure build around Arrow schemas which have the reader schema in the metadata, but you did not provide the schema in yours.

My test provides the Arrow reader schema and the top-level Avro record name in the metadata, which should be sufficient. The problem was in a schema mismatch: in the file, the array elements are not nullable.

I think this is a schema resolution bug based on a quick glance over details you provided.

That being said there are limitations with using AvroSchema::try_from to create a reader schema. For now my recommendation for creating a reader schema (especially more complicated ones) is to either:

Modify the writer schema's JSON
Manually craft the json for an AvroSchema.
Use AvroSchema::try_from, but sanitize the output and embed it into a pre-defined JSON wrapper if needed.

Originally the AvroSchema::try_from method was built for the Writer so that a correct AvroSchema is inferred from an Arrow Schema in the absence of a provided AvroSchema.

The biggest challenge to overcome relates to the lossy behavior inherent to Arrow -> Avro schema conversion, i.e. Arrow not having the concepts of named types, etc.

@EmilyMatt

The issue is probably in the AvroSchema::from
it has various bugs I've also encountered.

100%, It's absolutely not related to this PR. Sorry about not jumping in sooner to call that out.

As an aside, I just created #9233 which proposes an approach for modularizing schema.rs, adding an ArrowToAvroSchemaBuilder, and enhancing the overall AvroSchema conversion functionality. I'd love to get some feedback if either of you get an opportunity!

jecsand838

@EmilyMatt This looks good! I left some final feedback and recommendations, but I think it's at a place to re-run the CI/CD jobs if you wanted to follow-up on these. Once pipelines pass, I'll approve.

CC: @alamb

jecsand838 · 2026-01-21T01:16:40Z

arrow-avro/src/reader/async_reader/builder.rs

+        // If projection exists, project the reader schema,
+        // if no reader schema is provided, parse it from the header(get the raw writer schema), and project that
+        // this projected schema will be the schema used for reading.
+        let projected_reader_schema = self
+            .projection
+            .as_deref()
+            .map(|projection| {
+                let base_schema = if let Some(reader_schema) = &self.reader_schema {
+                    reader_schema.clone()
+                } else {
+                    let raw = header.get(SCHEMA_METADATA_KEY).ok_or_else(|| {
+                        ArrowError::ParseError("No Avro schema present in file header".to_string())
+                    })?;
+                    let json_string = std::str::from_utf8(raw)
+                        .map_err(|e| {
+                            ArrowError::ParseError(format!(
+                                "Invalid UTF-8 in Avro schema header: {e}"
+                            ))
+                        })?
+                        .to_string();
+                    AvroSchema::new(json_string)
+                };
+                base_schema.project(projection)
+            })
+            .transpose()?;


We should probably add more test coverage to the arrow-avro/src/reader/async_reader/builder.rs file.

Added tests, not all the error cases are covered in the builder, but it looks better now

jecsand838 · 2026-01-21T05:54:52Z

arrow-avro/src/reader/async_reader/mod.rs

+enum ReaderState<R: AsyncFileReader> {
+    /// Intermediate state to fix ownership issues
+    InvalidState,
+    /// Initial state, fetch initial range
+    Idle { reader: R },
+    /// Fetching data from the reader
+    FetchingData {
+        future: BoxFuture<'static, Result<(R, Bytes), ArrowError>>,
+        next_behaviour: FetchNextBehaviour,
+    },
+    /// Decode a block in a loop until completion
+    DecodingBlock { data: Bytes, reader: R },
+    /// Output batches from a decoded block
+    ReadingBatches {
+        data: Bytes,
+        block_data: Bytes,
+        remaining_in_block: usize,
+        reader: R,
+    },
+    /// An error occurred, should not have been polled again.
+    Error,
+    /// Done, flush decoder and return
+    Finished,
+}


After looking through the code, I strongly recommend making the stream "fused" so that it terminates after an error, rather than yielding infinite errors.

My concern is that currently, if the reader enters ReaderState::Error (or hits InvalidState), it returns Some(Err(...)) indefinitely. This causes consumers like StreamExt::collect to hang forever.

I recommend fixing this by strictly transitioning to ReaderState::Finished immediately upon encountering any error. This is similar to the Parquet Async Reader.

What do you think about these updates below?

Suggested change

enum ReaderState<R: AsyncFileReader> {

/// Intermediate state to fix ownership issues

InvalidState,

/// Initial state, fetch initial range

Idle { reader: R },

/// Fetching data from the reader

FetchingData {

future: BoxFuture<'static, Result<(R, Bytes), ArrowError>>,

next_behaviour: FetchNextBehaviour,

},

/// Decode a block in a loop until completion

DecodingBlock { data: Bytes, reader: R },

/// Output batches from a decoded block

ReadingBatches {

data: Bytes,

block_data: Bytes,

remaining_in_block: usize,

reader: R,

},

/// An error occurred, should not have been polled again.

Error,

/// Done, flush decoder and return

Finished,

}

enum ReaderState<R: AsyncFileReader> {

/// Intermediate state to fix ownership issues

InvalidState,

/// Initial state, fetch initial range

Idle { reader: R },

/// Fetching data from the reader

FetchingData {

future: BoxFuture<'static, Result<(R, Bytes), ArrowError>>,

next_behaviour: FetchNextBehaviour,

},

/// Decode a block in a loop until completion

DecodingBlock { data: Bytes, reader: R },

/// Output batches from a decoded block

ReadingBatches {

data: Bytes,

block_data: Bytes,

remaining_in_block: usize,

reader: R,

},

/// Successfully finished reading file contents; drain any remaining buffered records

/// from the decoder into (possibly partial) output batches.

Flushing,

/// Terminal state. Always yields `None` and never returns items again.

Finished,

}

Then we can do something like this for AsyncAvroFileReader::read_next:

/// Terminate the stream after returning this error once. #[inline] fn finish_with_error( &mut self, error: ArrowError, ) -> Poll<Option<Result<RecordBatch, ArrowError>>> { self.reader_state = ReaderState::Finished; Poll::Ready(Some(Err(error))) } #[inline] fn start_flushing(&mut self) { self.reader_state = ReaderState::Flushing; } /// Drain any remaining buffered records from the decoder. #[inline] fn poll_flush(&mut self) -> Poll<Option<Result<RecordBatch, ArrowError>>> { match self.decoder.flush() { Ok(Some(batch)) => { self.reader_state = ReaderState::Flushing; Poll::Ready(Some(Ok(batch))) } Ok(None) => { self.reader_state = ReaderState::Finished; Poll::Ready(None) } Err(e) => self.finish_with_error(e), } } fn read_next(&mut self, cx: &mut Context<'_>) -> Poll<Option<Result<RecordBatch, ArrowError>>> { loop { match mem::replace(&mut self.reader_state, ReaderState::InvalidState) { ReaderState::Idle { mut reader } => { let range = self.range.clone(); if range.start >= range.end { return self.finish_with_error(ArrowError::AvroError(format!( "Invalid range specified for Avro file: start {} >= end {}, file_size: {}", range.start, range.end, self.file_size ))); } let future = async move { let data = reader.get_bytes(range).await?; Ok((reader, data)) } .boxed(); self.reader_state = ReaderState::FetchingData { future, next_behaviour: FetchNextBehaviour::ReadSyncMarker, }; } ReaderState::FetchingData { mut future, next_behaviour, } => { let (reader, data_chunk) = match future.poll_unpin(cx) { Poll::Ready(Ok(data)) => data, Poll::Ready(Err(e)) => return self.finish_with_error(e), Poll::Pending => { self.reader_state = ReaderState::FetchingData { future, next_behaviour, }; return Poll::Pending; } }; match next_behaviour { FetchNextBehaviour::ReadSyncMarker => { let sync_marker_pos = data_chunk .windows(16) .position(|slice| slice == self.sync_marker); let block_start = match sync_marker_pos { Some(pos) => pos + 16, // Move past the sync marker None => { // Sync marker not found, valid if we arbitrarily split the file at its end. self.reader_state = ReaderState::Finished; return Poll::Ready(None); } }; self.reader_state = ReaderState::DecodingBlock { reader, data: data_chunk.slice(block_start..), }; } FetchNextBehaviour::DecodeVLQHeader => { let mut reader = reader; let mut data = data_chunk; // Feed bytes one at a time until we reach Data state (VLQ header complete) while !matches!(self.block_decoder.state(), BlockDecoderState::Data) { if data.is_empty() { return self.finish_with_error(ArrowError::AvroError( "Unexpected EOF while reading Avro block header".into(), )); } let consumed = match self.block_decoder.decode(&data[..1]) { Ok(consumed) => consumed, Err(e) => return self.finish_with_error(e), }; if consumed == 0 { return self.finish_with_error(ArrowError::AvroError( "BlockDecoder failed to consume byte during VLQ header parsing" .into(), )); } data = data.slice(consumed..); } // Now we know the block size. Slice remaining data to what we need. let bytes_remaining = self.block_decoder.bytes_remaining(); let data_to_use = data.slice(..data.len().min(bytes_remaining)); let consumed = match self.block_decoder.decode(&data_to_use) { Ok(consumed) => consumed, Err(e) => return self.finish_with_error(e), }; if consumed != data_to_use.len() { return self.finish_with_error(ArrowError::AvroError( "BlockDecoder failed to consume all bytes after VLQ header parsing" .into(), )); } // May need more data to finish the block. let range_to_fetch = match self.remaining_block_range() { Ok(range) if range.is_empty() => { // All bytes fetched, move to decoding block directly self.reader_state = ReaderState::DecodingBlock { reader, data: Bytes::new(), }; continue; } Ok(range) => range, Err(e) => return self.finish_with_error(e), }; let future = async move { let data = reader.get_bytes(range_to_fetch).await?; Ok((reader, data)) } .boxed(); self.reader_state = ReaderState::FetchingData { future, next_behaviour: FetchNextBehaviour::ContinueDecoding, }; continue; } FetchNextBehaviour::ContinueDecoding => { self.reader_state = ReaderState::DecodingBlock { reader, data: data_chunk, }; } } } ReaderState::InvalidState => { return self.finish_with_error(ArrowError::AvroError( "AsyncAvroFileReader in invalid state".into(), )); } ReaderState::DecodingBlock { mut reader, mut data, } => { // Try to decode another block from the buffered reader. let consumed = match self.block_decoder.decode(&data) { Ok(consumed) => consumed, Err(e) => return self.finish_with_error(e), }; data = data.slice(consumed..); // If we reached the end of the block, flush it, and move to read batches. if let Some(block) = self.block_decoder.flush() { // Successfully decoded a block. let block_count = block.count; // We completed (or resumed and completed) a block successfully. self.finishing_partial_block = false; let block_data = Bytes::from_owner(if let Some(ref codec) = self.codec { match codec.decompress(&block.data) { Ok(decompressed) => decompressed, Err(e) => { return self.finish_with_error(ArrowError::AvroError(format!( "Error decompressing Avro block with codec {codec:?}: {e}" ))); } } } else { block.data }); // Since we have an active block, move to reading batches self.reader_state = ReaderState::ReadingBatches { reader, data, block_data, remaining_in_block: block_count, }; continue; } // data should always be consumed unless Finished, if it wasn't, something went wrong if !data.is_empty() { return self.finish_with_error(ArrowError::AvroError( "Unable to make progress decoding Avro block, data may be corrupted" .into(), )); } if matches!(self.block_decoder.state(), BlockDecoderState::Finished) { // We've already flushed, so if no batch was produced, we are simply done. self.finishing_partial_block = false; self.start_flushing(); continue; } // If we've tried the following stage before, and still can't decode, // this means the file is truncated or corrupted. if self.finishing_partial_block { return self.finish_with_error(ArrowError::AvroError( "Unexpected EOF while reading last Avro block".into(), )); } // Avro splitting case: block is incomplete, we need to: // 1. Parse the length so we know how much to read // 2. Fetch more data from the object store // 3. Create a new block data from the remaining slice and the newly fetched data // 4. Continue decoding until end of block self.finishing_partial_block = true; // Mid-block, but we don't know how many bytes are missing yet if matches!( self.block_decoder.state(), BlockDecoderState::Count | BlockDecoderState::Size ) { // Max VLQ header is 20 bytes (10 bytes each for count and size). // Fetch just enough to complete it. const MAX_VLQ_HEADER_SIZE: u64 = 20; let fetch_end = (self.range.end + MAX_VLQ_HEADER_SIZE).min(self.file_size); // If there is nothing more to fetch, error out if fetch_end == self.range.end { return self.finish_with_error(ArrowError::AvroError( "Unexpected EOF while reading Avro block header".into(), )); } let range_to_fetch = self.range.end..fetch_end; self.range.end = fetch_end; // Track that we've fetched these bytes let future = async move { let data = reader.get_bytes(range_to_fetch).await?; Ok((reader, data)) } .boxed(); self.reader_state = ReaderState::FetchingData { future, next_behaviour: FetchNextBehaviour::DecodeVLQHeader, }; continue; } // Otherwise, we're mid-block but know how many bytes are remaining to fetch. let range_to_fetch = match self.remaining_block_range() { Ok(range) => range, Err(e) => return self.finish_with_error(e), }; let future = async move { let data = reader.get_bytes(range_to_fetch).await?; Ok((reader, data)) } .boxed(); self.reader_state = ReaderState::FetchingData { future, next_behaviour: FetchNextBehaviour::ContinueDecoding, }; continue; } ReaderState::ReadingBatches { reader, data, mut block_data, mut remaining_in_block, } => { let (consumed, records_decoded) = match self.decoder.decode_block(&block_data, remaining_in_block) { Ok((consumed, records_decoded)) => (consumed, records_decoded), Err(e) => return self.finish_with_error(e), }; remaining_in_block -= records_decoded; if remaining_in_block == 0 { if data.is_empty() { // No more data to read, drain remaining buffered records self.start_flushing(); } else { // Finished this block, move to decode next block in the next iteration self.reader_state = ReaderState::DecodingBlock { reader, data }; } } else { // Still more records to decode in this block, slice the already-read data and stay in this state block_data = block_data.slice(consumed..); self.reader_state = ReaderState::ReadingBatches { reader, data, block_data, remaining_in_block, }; } // We have a full batch ready, emit it // (This is not mutually exclusive with the block being finished, so the state change is valid) if self.decoder.batch_is_full() { match self.decoder.flush() { Ok(Some(batch)) => return Poll::Ready(Some(Ok(batch))), Ok(None) => { return self.finish_with_error(ArrowError::AvroError( "Decoder reported a full batch, but flush returned None".into(), )); } Err(e) => return self.finish_with_error(e), } } } ReaderState::Flushing => { return self.poll_flush(); } ReaderState::Finished => { // Terminal: once finished (including after an error), always yield None self.reader_state = ReaderState::Finished; return Poll::Ready(None); } } } }

Addressed this, also added a lint to prevent any accidental "?" in that function.
I will take a look at coverage later today, as it requires setting up some infra to test locally

EmilyMatt added 2 commits November 26, 2025 12:28

feat: Implement an AsyncReader for avro using ObjectStore

4ed172b

Merge branch 'main' into avro-async-reader

e5c7f57

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Nov 26, 2025

This was referenced Nov 26, 2025

Support Ref types in Avro Reader. apache/datafusion#18811

Open

Improve Avro Reader Types Support apache/datafusion#18810

Open

Use arrow-avro for performance and improved type support apache/datafusion#14097

Open

EmilyMatt added 4 commits November 26, 2025 14:24

feature gate use

f5bfd35

update comments

32b0760

file size is mandatory

2251be5

finish immediately

79af114

jecsand838 reviewed Nov 26, 2025

View reviewed changes

arrow-avro/Cargo.toml Outdated Show resolved Hide resolved

arrow-avro/src/reader/async_reader.rs Outdated Show resolved Hide resolved

jecsand838 reviewed Nov 26, 2025

View reviewed changes

arrow-avro/src/reader/async_reader.rs Outdated Show resolved Hide resolved

EmilyMatt added 3 commits November 27, 2025 12:05

remove object store form default

4e207ea

remove object store form default

e04337c

Merge branch 'main' into avro-async-reader

854cd95

EmilyMatt added 2 commits December 7, 2025 17:06

Use builder pattern, fallback to get the schema from the arrow schema…

8e0e46e

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

Merge branch 'main' into avro-async-reader

4f88571

remove accidental changes

cb19fad

jecsand838 reviewed Dec 9, 2025

View reviewed changes

arrow-avro/src/reader/async_reader/object_store_reader.rs Outdated Show resolved Hide resolved

Merge branch 'apache:main' into avro-async-reader

d59a2f5

EmilyMatt added 6 commits January 12, 2026 23:28

Merge branch 'refs/heads/main' into avro-async-reader

e57801d

rebase, address CR

939cf8e

Merge remote-tracking branch 'upstream/avro-async-reader' into avro-a…

03de289

…sync-reader

fix some docs

a071ab8

update cfg

1cc8d8a

Add some docs

1a2169b

Add a basic roundtrip test

c6634ad

jecsand838 reviewed Jan 16, 2026

View reviewed changes

jecsand838 mentioned this pull request Jan 18, 2026

[arrow-avro] Add AsyncWriter #9212

Open

EmilyMatt added 4 commits January 18, 2026 11:32

Merge branch 'main' into avro-async-reader

f7cffc9

# Conflicts: # arrow-avro/src/reader/mod.rs

address CR, fix some bugs

4361057

use smaller buffer

94aae91

add error state

3216788

mzabaluev mentioned this pull request Jan 20, 2026

ParseError("bad varint") on reading array field with nullable elements in the reader schema #9231

Open

jecsand838 reviewed Jan 21, 2026

View reviewed changes

address CR

138e72d

GaneshPatil7517 mentioned this pull request Jan 21, 2026

[arrow-avro] Add AsyncWriter #9241

Open

add tests

977f619

feat: Implement an AsyncReader for avro using ObjectStore #8930

Are you sure you want to change the base?

feat: Implement an AsyncReader for avro using ObjectStore #8930

Uh oh!

Conversation

EmilyMatt commented Nov 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Nov 26, 2025

Uh oh!

jecsand838 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Dec 1, 2025

Uh oh!

EmilyMatt commented Dec 7, 2025

Uh oh!

jecsand838 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

EmilyMatt commented Jan 11, 2026

Uh oh!

EmilyMatt commented Jan 11, 2026

Uh oh!

EmilyMatt commented Jan 12, 2026

Uh oh!

jecsand838 commented Jan 15, 2026

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mzabaluev commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jecsand838 commented Nov 26, 2025 •

edited

Loading

jecsand838 left a comment •

edited

Loading

mzabaluev commented Jan 19, 2026 •

edited

Loading

EmilyMatt commented Jan 20, 2026 •

edited

Loading

jecsand838 commented Jan 21, 2026 •

edited

Loading