Author/s: Ramiro Checa-Garcia
Language: en | Translations:
computing
ocaml

Last week, there was a discussion on OCaml-discuss about the continuity of the Owl project, which is the most important scientific numerical computing library in OCaml.

This got me thinking more about Owl and motivated me to perform a quick performance evaluation comparing OCaml/Owl with Python’s NumPy and Julia. I’m not an expert in benchmarking, but I wanted to share my two cents on the speed of matrix operations in OCaml. Although Owl already has automated performance tests comparing itself with equivalent Python and Julia libraries, I am interested in a broader perspective—particularly when a programmer might need to combine optimized numerical methods with other non-optimized operations.

Multidimensional arrays benchmarks

One key aspect of scientific numerical libraries is how they handle multidimensional arrays and how fast typical operations on these arrays can be performed. Inspired by the blog post ocaml-bigarray-vs-array, I decided to benchmark matrix operations across OCaml, Python, and Julia. The code in both languages is quite straightforward.

The core of the comparison is:

  • Create a square matrix filled with random floating-point numbers uniformly distributed between 0 and 1.
  • Measure the time to sum all elements of the matrix.
  • Perform this summation using optimized library functions and then using explicit loops.
  • For the loop-based summations, test both possible loop orders (commonly referred to as C-style (row-major) and Fortran-style (column-major) layouts).

For reference, I used:

  • Python 3.11.2
  • Julia 1.9.0
  • OCaml 4.12.1 compiler

Python and NumPy

Code

import time 
import numpy as np

def np_sum(arr, size):
    a = time.time_ns()
    fsum = np.sum(arr)
    b = time.time_ns()
    print("  Method np-sum  : %f (time %fs)" %(fsum,(b-a)/1000000000.0))
    return

def loop_sum(arr,size):
    lsum=0.0
    a = time.time_ns()
    for i in range(0,size):
        for j in range(0,size):
            lsum = lsum + arr[i,j]
    b = time.time_ns()
    print("  Method loop    : %f (time %fs)" %(lsum,(b-a)/1000000000.0)) 
    return

def loopt_sum(arr,size):
    tsum=0.0
    a = time.time_ns()
    for i in range(0,size):
        for j in range(0,size):
            tsum = tsum + arr[j,i]
    b = time.time_ns()
    print("  Method loop-t  : %f (time %fs)" %(tsum,(b-a)/1000000000.0))
    return


for asize in [1234,12345,23456]:  
    print("\n ==== Size A[n,n] ====== %d \n" %(asize))
    arr2  = np.random.rand(asize, asize)
    np_sum(arr2,asize)
    loop_sum(arr2,asize)
    loopt_sum(arr2,asize)

Result (in my desktop computer)

> ==== Size A[n,n] ====== n=1234  ====
>  Method np-sum  : 761566.315611 (time 0.000391s)
>  Method loop    : 761566.315611 (time 0.154743s)
>  Method loop-t  : 761566.315611 (time 0.151696s)
>
> ==== Size A[n,n] ====== n=12345 ====
>  Method np-sum  : 76199182.226933 (time 0.038816s)
>  Method loop    : 76199182.226942 (time 14.834441s)
>  Method loop-t  : 76199182.226923 (time 14.894826s)
>
> ==== Size A[n,n] ====== n=23456 ====
>  Method np-sum  : 275090911.896293 (time 0.133710s)
>  Method loop    : 275090911.896455 (time 52.886030s)
>  Method loop-t  : 275090911.896464 (time 53.198568s)

Julia

Code

using Printf

function jl_sum(arr) 
    a = time()
    fsum = sum(arr)
    b = time()
    @printf "  Method jl-sum  : %f (time %.6fs)\n" fsum (b-a)
end

function loop(arr,alen)
    global lsum=0
    a = time()
    for i = 1:alen
       for j = 1:alen
          global lsum += arr[i,j]
       end
    end
    b = time()
    @printf "  Method loop    : %f (time %.6fs)\n" lsum (b-a)
end

function loopt(arr,alen)
    global tsum=0
    a = time()
    for i = 1:alen
       for j = 1:alen
          global tsum += arr[j,i]
       end
    end
    b = time()
    @printf "  Method loop-t  : %f (time %.6fs)\n" tsum (b-a)
end

for asize in [1234,12345,23456,34567] 
    @printf "\n ==== Size A[n,n] ====== %d \n" asize
    arr  = rand(asize, asize)
    jl_sum(arr)
    loop(arr,asize)
    loopt(arr,asize)
end

Result (in my desktop computer)

> ==== Size A[n,n] ====== n=1234  ====
>  Method jl-sum  : 760961.820085 (time 0.000152s)
>  Method loop    : 760961.820085 (time 0.034048s)
>  Method loop-t  : 760961.820085 (time 0.031716s)
>
> ==== Size A[n,n] ====== n=12345 ====
>  Method jl-sum  : 76203592.302483 (time 0.034457s)
>  Method loop    : 76203592.302516 (time 2.140357s)
>  Method loop-t  : 76203592.302498 (time 2.344866s)
>
> ==== Size A[n,n] ====== n=23456====
>  Method jl-sum  : 275096914.914670 (time 0.129691s)
>  Method loop    : 275096914.914723 (time 7.598340s)
>  Method loop-t  : 275096914.914827 (time 8.538193s)

Comments on Python and Julia

From the results with Julia and Python, we observe that NumPy’s np.sum and the equivalent method in Julia (jl-sum) produce similar results in terms of performance. However, when performing the same calculation using explicit loops, Julia is noticeably faster than Python.

It is a common sentiment in the Julia community to say: don’t be afraid to use loops and nested loops in Julia, as they tend to be fast. This is especially true for users coming from Python.

That said, when comparing the highly optimized sum functions (jl-sum and np-sum), both languages show a significant performance advantage over the loop-based versions.

OCaml

In OCaml following the reference ocaml-bigarray-vs-array I have tried more methods for different libraries of arrays.

Code

open Owl

let array_at m i j = Array.unsafe_get (Array.unsafe_get m i ) j

let sum_array_no_inline n m = let a = ref 0. in
    for i = 0 to (n-1) do
      for j = 0 to (n-1) do
        a := !a +. array_at m i j 
      done;
    done;
    !a

let sum_array n m = let a = ref 0. in
  for i = 0 to (n-1) do
    for j = 0 to (n-1) do
      a := !a +. (Array.unsafe_get (Array.unsafe_get m i ) j )
    done;
  done;
  !a

let sum_arrayt n m = let a = ref 0. in
  for i = 0 to (n-1) do
    for j = 0 to (n-1) do
      a := !a +. (Array.unsafe_get (Array.unsafe_get m j ) i )
    done;
  done;
  !a

let sum_big_array n m = let a = ref 0. in
  for i = 0 to (n-1) do
    for j = 0 to (n-1) do
      a := !a +. (Bigarray.Array2.unsafe_get m i j);
    done;
  done;
  !a

let sum_big_arrayt n m = let a = ref 0. in
  for i = 0 to (n-1) do
    for j = 0 to (n-1) do
      a := !a +. (Bigarray.Array2.unsafe_get m j i);
    done;
  done;
  !a

let sum_owl_array n m = let a = ref 0. in
  for i = 0 to (n-1) do
    for j = 0 to (n-1) do
      a := !a +. Mat.get m i j;
    done;
  done;
  !a

let sum_owl_array_lib n m = Mat.sum' m

let time f x =
  let start = Unix.gettimeofday () in 
  let res = f x in 
  let stop = Unix.gettimeofday () in 
  Printf.printf ":  %f (time %fs)\n%!" res (stop -. start);
  res

Results with ocamlc (bytecode)

> ==== Size A[n,n] ======= n=1234  ====
> [ArrayNoInline] :  761505.037418 (time 0.038757s)
> [Array        ] :  761505.037418 (time 0.029140s)
> [Arrayt       ] :  761505.037418 (time 0.029490s)
> [BigArray     ] :  761505.037421 (time 0.035729s)
> [BigArrayt    ] :  761505.037421 (time 0.036163s)
> [OwlArray     ] :  761505.037418 (time 0.062719s)
> [OwlArrayLib  ] :  761505.037418 (time 0.000493s)
>
> ==== Size A[n,n] ======= n=12345 ====
> [ArrayNoInline] :  76201442.335659 (time 3.410532s)
> [Array        ] :  76201442.335659 (time 2.576479s)
> [Arrayt       ] :  76201442.335612 (time 2.721879s)
> [BigArray     ] :  76201442.335401 (time 3.146339s)
> [BigArrayt    ] :  76201442.335364 (time 3.787680s)
> [OwlArray     ] :  76201442.335659 (time 5.330613s)
> [OwlArrayLib  ] :  76201442.335678 (time 0.031934s)
>
> ==== Size A[n,n] ======= n=23456 ====
> [ArrayNoInline] :  275097306.304532 (time 12.392087s)
> [Array        ] :  275097306.304532 (time 9.196658s)
> [Arrayt       ] :  275097306.304435 (time 10.553999s)
> [BigArray     ] :  275097306.304219 (time 11.405834s)
> [BigArrayt    ] :  275097306.304346 (time 10.811333s)
> [OwlArray     ] :  275097306.304532 (time 16.714575s)
> [OwlArrayLib  ] :  275097306.304379 (time 0.113167s)

Results with ocamlopt

> ==== Size A[n,n] ======= n=1234  ====
> [ArrayNoInline] :  761056.305873 (time 0.005656s)
> [Array        ] :  761056.305873 (time 0.001035s)
> [Arrayt       ] :  761056.305873 (time 0.001341s)
> [BigArray     ] :  761056.305869 (time 0.016464s)
> [BigArrayt    ] :  761056.305869 (time 0.015194s)
> [OwlArray     ] :  761056.305873 (time 0.009826s)
> [OwlArrayLib  ] :  761056.305873 (time 0.000259s)
>
> ==== Size A[n,n] ======= n=12345 ====
> [ArrayNoInline] :  76202418.923891 (time 0.407937s)
> [Array        ] :  76202418.923891 (time 0.101955s)
> [Arrayt       ] :  76202418.923849 (time 0.218512s)
> [BigArray     ] :  76202418.923890 (time 1.473963s)
> [BigArrayt    ] :  76202418.923885 (time 1.932328s)
> [OwlArray     ] :  76202418.923891 (time 0.980752s)
> [OwlArrayLib  ] :  76202418.923860 (time 0.032272s)
>
> ==== Size A[n,n] ======= n=23456 ====
> [ArrayNoInline] :  275097850.636376 (time 1.404471s)
> [Array        ] :  275097850.636376 (time 0.351583s)
> [Arrayt       ] :  275097850.636556 (time 1.019869s)
> [BigArray     ] :  275097850.637126 (time 5.146998s)
> [BigArrayt    ] :  275097850.637142 (time 7.019139s)
> [OwlArray     ] :  275097850.636376 (time 3.568812s)
> [OwlArrayLib  ] :  275097850.636496 (time 0.112816s)

Final comparison

When comparing OCaml/Owl with Python and Julia, we observe that:

  1. Owl’s specific methods are as fast as Julia’s or NumPy’s optimized methods, probably because all rely on similar underlying OpenBLAS libraries (or equivalents).
  2. OCaml loops are consistently faster than Python loops.
  3. OCaml loops are competitive with Julia loops—slightly faster when compiled with ocamlopt, and a bit slower with ocamlc.
  4. OCaml compiled with ocamlopt seems to be the fastest option overall.

IMPORTANT: For Julia, testing with and without the @inbounds macro showed no clear performance impact, only a slight improvement for the largest matrix sizes.
For OCaml, replacing Array.unsafe_get with Array.get (similar to Julia’s bounds-checked access) also showed only a very slight performance impact, even for large matrices, both with ocamlc and ocamlopt.
No special optimizations were applied in any language (e.g., no flambda in OCaml).

Discussion

We could conclude that the choice of language may not matter much since, with proper libraries, performance is similar across the three. However, in real-world scenarios, many programmers mix library-optimized methods with language-native functions and loops that are not part of these optimized bindings.

In Python, many everyday users (though maybe not professional developers) still rely heavily on loops, which can be a bottleneck. This has been one motivation to transition to Julia, where loops generally perform better.

From this perspective, using OCaml—where almost all methods provide good performance compared to Python—seems to be a preferred option. Julia is also reasonable, offering a pretty large standard library for numerical methods.

On the other hand, at least for Julia 1.9.0, there remains a challenge often called time-to-first-plot—the Just-In-Time compiler incurs overhead that can make simple scripts slower to start compared to Python. Also, the JIT compiler does not provide significant speed advantages beyond what OCaml’s ahead-of-time compiler already offers. Furthermore, neither Julia nor Python (to my knowledge) provide straightforward ways to compile code into standalone binaries.

Personally, I am happy with my choice to learn OCaml and Owl for scientific numerical computing (but looking forward also to raven-ml).