Common File Formats in Apache Spark: CSV, JSON, Avro, and Parquet
Understanding file formats commonly used in apache spark
Introduction
Apache Spark supports multiple file formats for reading and writing data, each optimized for different use cases. Choosing the right format can dramatically impact storage costs, query performance, and overall system efficiency.
In modern data engineering, the file format decision isn’t just about compatibility. It’s about optimizing for your workload patterns: Are you writing data once and reading it many times? Do you need schema evolution? Are you processing entire rows or just specific columns?
This guide explores the four most common file formats in Spark ecosystems: CSV, JSON, Avro, and Parquet. Understanding their strengths and trade-offs will help you make informed decisions that can reduce query times from minutes to seconds and cut storage costs by over 95%.
Core Concepts
CSV (Comma-Separated Values)
Row-based plain text format where each line represents a record, and commas separate field values.
Key Characteristics
Human-readable and universally supported
No built-in schema definition (schema must be inferred or specified)
Minimal compression capabilities
Fast to write but slow to read at scale
No support for nested data structures
Performance
Fastest write times among all formats
Poor compression ratio (minimal size reduction)
Inefficient for large-scale analytics due to lack of optimization
Example
# Reading CSV in Spark
df = spark.read.csv(”data.csv”, header=True, inferSchema=True)
df.write.mode(”overwrite”).csv(”output.csv”)JSON (JavaScript Object Notation)
Text-based format that represents data as key-value pairs, supporting nested and hierarchical structures.
Key Characteristics
Self-describing schema (schema stored with each record)
Excellent for semi-structured and nested data
Human-readable and web-friendly
Schema flexibility without predefined structure
Largest storage footprint due to schema repetition
Performance
Stores schema attributes with every row, leading to significant overhead
Poor compression efficiency
Slower read/write compared to binary formats
Ideal for data exchange, not long-term storage
Example
from pyspark.sql.functions import col
# Reading JSON in Spark
df = spark.read.json(”data.json”)
# Writing JSON
df.write.mode(”overwrite”).json(”output.json”)
# Handling nested structures
df.select(col(”user.name”), col(”user.email”)).show()Avro (Row-Based Binary Format)
Row-oriented binary format developed by Apache, designed for data serialization with embedded schema support.
Characteristics
Row-based storage (entire records stored sequentially)
Schema stored separately in header (not with each row)
Excellent schema evolution support
Efficient serialization/deserialization
Compression ratio: ~91%
Performance Profile
Superior write performance (best for streaming)
Efficient for operations requiring entire row reads
Optimized for append-heavy workloads
Less efficient for column-specific queries
Use Cases
Kafka streaming pipelines
Real-time data ingestion
Systems requiring frequent schema changes
Cross-platform data exchange
Example
# Reading Avro in Spark (requires spark-avro package)
df = spark.read.format(”avro”).load(”data.avro”)
# Writing Avro
df.write.format(”avro”).mode(”overwrite”).save(”output.avro”)Parquet (Columnar Binary Format)
Columnar storage format where data is organized by columns rather than rows, optimized for analytical queries.
Characteristics
Columnar storage (data from same column stored together)
Advanced compression techniques (dictionary encoding, run-length encoding, bit-packing)
Predicate pushdown and column pruning support
Compression ratio: ~97.5%
Spark’s default format
Performance
Read times up to 3x faster than Avro for analytical queries
Slower writes compared to row-based formats
Exceptional compression (194GB CSV → 4.7GB Parquet in benchmarks)
Efficient I/O when accessing subset of columns
Optimization
Only required columns are read from disk
Statistical metadata enables partition pruning
Compatible with most big data frameworks (Spark, Hive, Presto, Snowflake)
Example
# Reading Parquet in Spark
df = spark.read.parquet(”data.parquet”)
# Writing Parquet with compression
df.write.mode(”overwrite”).option(”compression”, “snappy”) .parquet(”output.parquet”)
# Column pruning - only reads required columns
df_subset = spark.read.parquet(”data.parquet”).select(”id”, “amount”)
# Partitioning for better query performance
df.write.partitionBy(”date”, “region”).parquet(”partitioned_output.parquet”)Key Takeaways
CSV: Fast writes, human-readable, poor compression. Use for debugging only.
JSON: Handles nested data, terrible storage efficiency. Use for API ingestion, then convert.
Avro: Row-based, fast writes, 91% compression. Best for streaming and real-time ingestion.
Parquet: Columnar, 97.5% compression, 3x faster reads. Default choice for analytics.
Hybrid approach: Ingest with Avro, store and analyze with Parquet for optimal performance.
File format impacts: Storage costs, query performance, and system scalability.



