Performance Analysis: Read-In, Write-Out with Native IO, Pandas, Polars, and Modin

Evaluating Time and Memory Efficiency Across Methods

May 14, 2024

Created on: Nov 18, 2023

Last updated on: May 14, 2024

Background

When it comes to working with tabular data, data science practitioners will kickstart the process in a fast manner with their go-to libraries. During the data crunching process, one might face roadblocks such as

Memory consumption is too high and subsequently causes the program to crash on a lower-spec machine
Long processing time which hinders the progress of the projects.

This article elaborates on the options of data analysis libraries particularly looking into time and memory consumption. The goal is to show the tradeoffs between the Native IO method, Polars, Pandas, and Modin libraries. This would inspire you to take a step back and choose the suitable tool that fits your next project.

TLDR:

For small-size datasets (<5 Mb), opt for Polars instead of Pandas.
For huge datasets (>5 Mb), use Modin to leverage distributed processing across cores.
Choose the Native IO method to get the maximum boost both in time and memory perspective in the tradeoff of inconvenience to perform data analysis.

One-Line Explanation of Each Approach/Library

Before proceeding further, a compact elaboration of each of the method/library is presented below.

Native IO: Python’s built-in read-write operations with

with open("sample.csv", "r") as f:
    pass

Polars: Rust-based data analysis library designed for high performance.
Pandas: Single-threaded comprehensive data analysis library.
Modin: Data analysis library with distributed computing capabilities.

For a more thorough understanding, kindly check out the link and other resources. This article will assume a foundational understanding of each method as a prerequisite.

Use Case

To compare on the same baseline, the use case in this article will focus on the aggregation of data from n multiple files to an output file as shown in the diagram below.

Get Your Hands Dirty

Code is hosted on GitHub while data can be retrieved from the resources below.

Small-sized data: winequality-red.csv (0.1 Mb~)
Large-sized data: train_essays_7_prompts_v2.csv (36.5 Mb~)

Dependencies

Dependencies are listed in the requirements.txt.

memory_profiler
line_profiler
pandas
polars
modin[ray]
cchardet
click
distributed

With Python equal to or above version 3.10, install the dependencies with

python -m pip install -r requirements.txt

How to Run

The scripts below contain the two files dedicated to measuring time and memory separately.

compare_time.py
compare_memory.py

Retrieve the guidelines for running the scripts with the following command:

python compare_time.py --help
python compare_memory.py --help

The parameters that can be passed to each script are listed below.

Required:

engine: Engine to process data. Supported options: [io, pandas, polars, modin].

Optional:

datapath: Datapath where the CSV file exists (without filename).
csvfilename: CSV file name (without extension). Supported options: [train_essays_7_prompts_v2, winequality-red].
duplicate: Number of times to duplicate the dataframe. Default: 10.

Kindly note that the datapath and CSV filename parameters decide where will the file be read from. For example with datapath as data/ and csvfilename as winequality-red, the data file is assumed to be found on <current-path>/data/winequality-red.csv.

The parameter duplicate decides how many times the file will be retrieved and subsequently aggregated.

With everything set, run with

python compare_time.py --engine polars --datapath data/ --csvfilename winequality-red --duplicate 10

The simplest manner to run the command is by only specifying the required parameters — engine:

python compare_time.py --engine polars

Performance Evaluation

The following tests are performed on a server with specifications:

CPU: Intel i9 with 18 cores and 36 threads
Memory: 62 Gb

The parameters are fixated to default values with only the changing of files and engine.

Time Consumption

Takeaways from the table above:

Between Polars and Pandas, Polars is faster than Pandas when processing both files.
Panda’s processing time increases significantly with the increase of file size while Polars shows a fairly small increment.
Modin does not show leverage for small-sized files due to a relatively long initialization time (shown in the diagram below). However, it is worth noting that the time consumed does not increase with the increasing in file sizes.

Native IO methods top the charts with the shortest time even though the content is read line by line during aggregation (shown in the diagram below).

Memory Consumption

Takeaways from the table above:

Polars perform slightly worse than Pandas in memory consumption when the file sizes increase.
Likewise in the measurement of time consumption, Modin consumes relatively more memory for small-sized files. However, Modin does not show increasing memory consumption with the increase in file sizes. It comes as the runner-up with the second smallest memory consumption for large-sized files (which makes it suitable for managing large files).
The native IO method thrives with the smallest memory consumption unaffected by file sizes.

With both time and memory consumption factored in, the consideration factors for each approach can be summarized as follows

Here are some crucial observations derived from the conducted assessments.

For small-size datasets (<5 Mb), opt for Polars instead of Pandas. While there are small learning curve due to the changes in syntax, one would get familiar in no time.
For huge datasets (>5 Mb), use Modin to leverage distributed processing across cores. Similarly, one would have to get used to using the Modin library. Yet, since it is designed with Panda’s existing user base in mind, it should be fairly doable while the gains from doing the switch will be significant.
Choose the Native IO method to get the maximum efficiency in time and memory. At times, certain use cases rely on read-in and write-out operations heavily without the need for data-crunching processes, thus it is possible to fall back to the basic approach.

This is not the end of it.

As the tests are executed within the scope outlined in this article, kindly consider using the same strategy to get numeric insights on which approaches are suitable for your use case. There is no one-size-fits-all approach, one should prioritize which metric is the most crucial (whether it’s time, memory, or a combination of both)

This is especially important if there is a need to build it once and run it recursively with minimum changes. The building blocks will determine how much time and memory (and these metrics eventually translate to money and resources) will be saved from the entire practice.

Thanks for reading.

Codenamewei’s Technical Post

Cover topics of machine learning applications. Subscribe to get the latest posts.

Codenamewei's Technical Posts

Discussion about this post