Jun 24, 2026

F3 Is a Bet That Data Files Should Carry Their Own Decoders

F3 looks odd if you land on the repository first.

The README says it is a data file format for efficiency, interoperability, and extensibility. It says it fixes layout shortcomings in older columnar formats such as Parquet. It says it embeds WebAssembly decoders. Then it warns you that this is a research prototype and should not be used in production.

That warning is important. F3 is not a drop-in Parquet replacement you should roll into a lakehouse this afternoon. It is a research project attached to a SIGMOD paper, and the real argument lives in the paper rather than in the repository landing page.

The interesting idea is simple enough to state:

What if a data file could include not only its data and metadata, but also the code needed to decode any new encoding used inside it?

That is the core of F3, short for Future-proof File Format. It is an open-source columnar file format aimed at analytic workloads where Parquet and ORC are good but increasingly stretched. F3 keeps the broad shape of a modern columnar format: metadata, row groups, column-oriented storage, encoded units, and support for vectorized readers. But it changes two parts of the model that matter a lot for long-lived data systems.

First, it separates layout decisions that older formats tend to tie together. Second, it treats encodings as plug-ins and embeds a WebAssembly version of the decoder inside the file.

That combination is the whole pitch. F3 is not just asking whether a file can be smaller or faster on one benchmark. It is asking whether a file format can evolve without forcing the entire data ecosystem to upgrade in lockstep.

The Parquet Problem Is Not That Parquet Is Bad

Parquet won because it solved a practical problem. It gave data systems a shared, open, columnar format that worked across engines, languages, and storage systems. If you want a file that Spark, DuckDB, Trino, Arrow, Python, Rust, Java, and cloud object stores can all deal with, Parquet is the default answer for good reasons.

That default status is also the trap.

Parquet and ORC came from the early 2010s. They were designed around the systems, hardware, and workloads of that era: Hadoop-style analytics, batch scans, row groups, fewer columns, and a very different balance between storage, network, and CPU cost. Since then, storage and network throughput improved dramatically, cloud object stores became normal, machine-learning feature tables became wide, vector embeddings became ordinary data, and workloads started mixing full scans with selective reads and random access.

The F3 paper argues that the old formats are now constrained by assumptions that used to be reasonable. A row group is too blunt a unit when you want to tune I/O, encoding, dictionaries, and metadata independently. Metadata layouts become expensive when a table has thousands or tens of thousands of columns and a job only needs a small subset. And new encodings are hard to deploy because file compatibility depends on readers knowing how to decode them.

The last point is the most painful one.

Open formats can add new features on paper. The ecosystem may still avoid those features for years because old readers break, some implementations lag, and large organizations cannot assume every query engine has the same library version. The F3 paper points out that many real-world Parquet files still stick close to older Parquet features even when written by modern software.

That is not irrational conservatism. It is compatibility pressure.

If a file is meant to move across systems, the safest writer targets the oldest widely supported feature set. That keeps data readable, but it also means the format evolves more slowly than compression research, hardware, workload shape, and engine design.

F3 Splits The File Into More Tunable Pieces

F3 keeps a columnar structure, but it tries to avoid one unit doing too many jobs.

In Parquet, the row group carries a lot of responsibility. It is a horizontal partition of the table. It affects buffering during writes. It influences I/O size. It bounds column chunks. It interacts with dictionaries, pages, compression, and skipping structures.

That works well enough for many analytics jobs. It becomes awkward when the ideal size for one concern is not the ideal size for another.

F3 introduces a more explicit separation:

The logical row group still exists.
An I/O unit can be sized for the storage medium and access pattern.
An encoding unit is the smallest encoded/decoded byte buffer.
Dictionary scope can be chosen independently instead of being welded to the row group.
Column metadata is stored so readers can reach only the metadata they need.

The result is a file layout that is less monolithic. A reader should not have to deserialize the whole footer just to inspect a few columns in a very wide table. A writer should not have to accept the same boundary for I/O, encoding, and dictionary effectiveness. A format should have a place to grow indexes and filters without turning the entire file into a compatibility puzzle.

This is the quiet part of F3. The Wasm decoder idea gets the attention because it sounds unusual, but the layout work is just as important. A self-decoding file still needs a good physical organization. Otherwise it is only a clever packaging trick around a mediocre storage format.

The Decoder-In-The-File Idea

F3’s most distinctive move is to embed decoder implementations as WebAssembly binaries.

The file still contains data and metadata. But if the file uses an encoding that a reader does not natively know, the reader can use the Wasm decoder stored in the file. The decoder implements a public API and turns encoded bytes into Arrow-style buffers that the engine can consume.

That changes the compatibility contract.

In a traditional format, a new encoding requires all relevant readers to learn that encoding before files can safely depend on it. In F3, a writer can include the decoder alongside the data. A native implementation can still be used when available, but an older reader is not automatically helpless just because the file uses a newer encoding.

This is the part that makes F3 feel less like “Parquet with a new layout” and more like an attempt to change the deployment model for file-format evolution.

It also explains the “future-proof” name. The claim is not that the authors know which encoding will win in ten years. The claim is that the file format should not need a new ecosystem-wide standardization round every time an encoding, compression scheme, or workload pattern changes.

If the file can carry the decoder, then old infrastructure has a fallback path.

There are real caveats. A Wasm decoder is still code. It has to be sandboxed. It can consume CPU. It can contain bugs. It may be harder to debug than a known native library. And if the main value proposition depends on running untrusted decoders from arbitrary files, production systems will need clear policies: allow lists, resource limits, deterministic execution, disabled I/O, runtime isolation, and perhaps a preference for well-known external decoder registries in sensitive environments.

The HN discussion immediately focused on this point, and it is the right concern. “It is Wasm” is not a security plan by itself. Wasm gives a better sandboxing starting point than native code, but systems still need to decide what the decoder is allowed to do, how long it may run, how much memory it may use, and whether inline decoders are permitted at all.

The analogy is less “download and run an executable” and more “open a file using a constrained virtual machine.” That can be reasonable. Fonts have contained executable hinting programs. Browsers run Wasm. Databases run user-defined code in some settings. But each of those examples comes with years of operational scars.

F3’s idea is promising precisely because it moves a hard compatibility problem into a runtime boundary. That boundary then has to be engineered like a serious boundary.

Why Compatibility Beats Elegance

The biggest obstacle for F3 is not whether the paper has good ideas. It is whether any new analytic file format can beat the gravitational pull of Parquet.

Compatibility is a feature. In data infrastructure, it is often the feature.

A format that is 20 percent better but unsupported by the tools already in a company’s stack may be worse than a format that is boring and everywhere. Teams do not choose file formats in a vacuum. They choose them because of query engines, notebooks, cloud services, catalogs, validators, storage policies, language libraries, monitoring tools, and the people who will have to debug incidents at 2 a.m.

That is why “just build a better format” is hard. New formats lose by default because they begin with no installed base. Every unsupported engine is a migration blocker. Every missing library is a support ticket. Every ambiguous semantic edge is a data correctness risk.

F3’s embedded-decoder strategy is a direct response to that adoption wall. It says: do not require every reader to know every future encoding. Give every file a way to explain itself.

But that does not solve everything.

A query engine still needs an F3 reader. The reader still needs to understand the container, metadata, schemas, offsets, buffers, and decoder API. The ecosystem still needs test suites, compatibility matrices, language implementations, fuzzing, security reviews, and stable semantics. The files still need to work well with object stores, catalogs, versioned datasets, permission systems, and data lake table formats.

The embedded decoder is a clever bridge, not a full civilization.

That is why the project status matters. The repository describes F3 as a proof-of-concept package and a research prototype. The code is useful for validating ideas and reproducing experiments. It is not a mature data platform contract.

The Archival Angle

There is another way to read F3 that may be more compelling than “replace Parquet now.”

F3 may be a candidate for thinking about archival analytic data.

In archives, the question is not only “can my current engine scan this quickly?” It is also “can someone read this later when the original writer, library version, and engine are gone?” A self-describing file that carries metadata, data, and decoder logic has an obvious appeal.

This does not mean F3 automatically beats simpler archival formats. CSV and JSON have an enormous advantage: humans can inspect them directly, and future readers can reconstruct a lot with minimal machinery if the schema is documented. SQLite has a different advantage: one file, stable public format, broad tools, schema included, and decades of compatibility.

F3 is aiming at a different region: large analytic data where plain text is too expensive, columnar layout matters, and future decoders may be needed to preserve efficient encodings.

The tradeoff is that future readability now depends on the container spec, the Wasm runtime assumption, and the decoder API. That may still be a good trade for large columnar data, but it is not free. A future archivist may prefer a slower, simpler, more widely documented representation over a clever self-decoding one.

That tension is healthy. “Future-proof” should invite skepticism. Real future-proofing is not a slogan; it is boring compatibility work repeated over years.

Where F3 Is Strong

The best thing about F3 is that it identifies a real failure mode in modern data formats.

Data formats age. Encodings age. Hardware assumptions age. Workloads change. But files last, and large organizations hate breaking readers. That means successful formats often become conservative by necessity. They preserve interoperability by freezing themselves around the lowest common denominator.

F3 attacks that exact problem.

Its strongest claims are:

File layout should let metadata, I/O, dictionary scope, and encoding units evolve separately.
Wide tables and selective feature access deserve better metadata access patterns.
New encodings should not require every reader in the world to upgrade first.
A file should have a compatibility fallback for decoding its own data.
The extension path should be designed into the format rather than bolted on later.

Those are good instincts. They line up with how data systems actually fail: not as isolated algorithms, but as messy ecosystems where feature rollout, tool support, and operational caution dominate.

Where F3 Needs Proof

The hard part is no longer the idea. It is proof at ecosystem scale.

F3 needs more than benchmark wins. It needs examples that make the value obvious to working engineers. It needs a README that explains what problem a user has, why Parquet is insufficient for that problem, and what a minimal F3 write/read flow looks like. It needs clear threat modeling for embedded Wasm. It needs reference decoders, compatibility tests, and integration stories for Arrow, DuckDB, DataFusion, Spark, object stores, and table formats.

It also needs to show where the Wasm fallback is actually used.

If every serious deployment disables embedded decoders and only allows known native decoders, the design becomes less radical. It may still be useful as a plug-in architecture, but the universal self-decoding story weakens. If deployments allow embedded decoders, then sandboxing, resource limits, and trust policy become first-class format concerns.

Neither path is disqualifying. They are just different products.

The research prototype can defer those choices. A production format cannot.

The Real Lesson

F3 is worth paying attention to because it reframes a familiar problem.

Most file-format debates compare current performance: this scans faster, that compresses better, this has better random access, that has better support. F3 asks a longer-term question: how does a file format survive the next decade of changes without becoming another compatibility cage?

That question matters.

Parquet is still the practical default. It has the ecosystem. It has the tooling. It has the inertia that makes data formats useful in the first place. F3 does not change that today.

But F3 points at a design pressure that will not go away. Analytic data is getting wider, stranger, more multimodal, and more long-lived. New encodings will keep appearing. Workloads will keep mixing scans, random access, feature selection, vectors, blobs, and cloud-storage latency. The old assumption that a file format can evolve mostly through spec updates and library upgrades is increasingly expensive.

Maybe F3 itself becomes a production format. Maybe it remains a research artifact. Maybe the important idea gets absorbed into another format or a future Parquet generation. The useful part is the challenge it poses:

A modern data file should not only store bytes efficiently. It should also explain how those bytes can continue to be read when the world around the file has changed.