Last week, there was a discussion on OCaml-discuss about the continuity of the Owl project, which is the most important scientific numerical computing library in OCaml.
This got me thinking more about Owl and motivated me to perform a quick performance evaluation comparing OCaml/Owl with Python’s NumPy and Julia. I’m not an expert in benchmarking, but I wanted to share my two cents on the speed of matrix operations in OCaml. Although Owl already has automated performance tests comparing itself with equivalent Python and Julia libraries, I am interested in a broader perspective—particularly when a programmer might need to combine optimized numerical methods with other non-optimized operations.
Multidimensional arrays benchmarks
One key aspect of scientific numerical libraries is how they handle multidimensional arrays and how fast typical operations on these arrays can be performed. Inspired by the blog post ocaml-bigarray-vs-array, I decided to benchmark matrix operations across OCaml, Python, and Julia. The code in both languages is quite straightforward.
The core of the comparison is:
- Create a square matrix filled with random floating-point numbers uniformly distributed between 0 and 1.
- Measure the time to sum all elements of the matrix.
- Perform this summation using optimized library functions and then using explicit loops.
- For the loop-based summations, test both possible loop orders (commonly referred to as C-style (row-major) and Fortran-style (column-major) layouts).
For reference, I used:
- Python 3.11.2
- Julia 1.9.0
- OCaml 4.12.1 compiler
Python and NumPy
Code
import time
import numpy as np
def np_sum(arr, size):
a = time.time_ns()
fsum = np.sum(arr)
b = time.time_ns()
print(" Method np-sum : %f (time %fs)" %(fsum,(b-a)/1000000000.0))
return
def loop_sum(arr,size):
lsum=0.0
a = time.time_ns()
for i in range(0,size):
for j in range(0,size):
lsum = lsum + arr[i,j]
b = time.time_ns()
print(" Method loop : %f (time %fs)" %(lsum,(b-a)/1000000000.0))
return
def loopt_sum(arr,size):
tsum=0.0
a = time.time_ns()
for i in range(0,size):
for j in range(0,size):
tsum = tsum + arr[j,i]
b = time.time_ns()
print(" Method loop-t : %f (time %fs)" %(tsum,(b-a)/1000000000.0))
return
for asize in [1234,12345,23456]:
print("\n ==== Size A[n,n] ====== %d \n" %(asize))
arr2 = np.random.rand(asize, asize)
np_sum(arr2,asize)
loop_sum(arr2,asize)
loopt_sum(arr2,asize)
Result (in my desktop computer)
> ==== Size A[n,n] ====== n=1234 ====
> Method np-sum : 761566.315611 (time 0.000391s)
> Method loop : 761566.315611 (time 0.154743s)
> Method loop-t : 761566.315611 (time 0.151696s)
>
> ==== Size A[n,n] ====== n=12345 ====
> Method np-sum : 76199182.226933 (time 0.038816s)
> Method loop : 76199182.226942 (time 14.834441s)
> Method loop-t : 76199182.226923 (time 14.894826s)
>
> ==== Size A[n,n] ====== n=23456 ====
> Method np-sum : 275090911.896293 (time 0.133710s)
> Method loop : 275090911.896455 (time 52.886030s)
> Method loop-t : 275090911.896464 (time 53.198568s)
Julia
Code
using Printf
function jl_sum(arr)
a = time()
fsum = sum(arr)
b = time()
@printf " Method jl-sum : %f (time %.6fs)\n" fsum (b-a)
end
function loop(arr,alen)
global lsum=0
a = time()
for i = 1:alen
for j = 1:alen
global lsum += arr[i,j]
end
end
b = time()
@printf " Method loop : %f (time %.6fs)\n" lsum (b-a)
end
function loopt(arr,alen)
global tsum=0
a = time()
for i = 1:alen
for j = 1:alen
global tsum += arr[j,i]
end
end
b = time()
@printf " Method loop-t : %f (time %.6fs)\n" tsum (b-a)
end
for asize in [1234,12345,23456,34567]
@printf "\n ==== Size A[n,n] ====== %d \n" asize
arr = rand(asize, asize)
jl_sum(arr)
loop(arr,asize)
loopt(arr,asize)
end
Result (in my desktop computer)
> ==== Size A[n,n] ====== n=1234 ====
> Method jl-sum : 760961.820085 (time 0.000152s)
> Method loop : 760961.820085 (time 0.034048s)
> Method loop-t : 760961.820085 (time 0.031716s)
>
> ==== Size A[n,n] ====== n=12345 ====
> Method jl-sum : 76203592.302483 (time 0.034457s)
> Method loop : 76203592.302516 (time 2.140357s)
> Method loop-t : 76203592.302498 (time 2.344866s)
>
> ==== Size A[n,n] ====== n=23456====
> Method jl-sum : 275096914.914670 (time 0.129691s)
> Method loop : 275096914.914723 (time 7.598340s)
> Method loop-t : 275096914.914827 (time 8.538193s)
Comments on Python and Julia
From the results with Julia and Python, we observe that NumPy’s np.sum
and the equivalent method in Julia (jl-sum
) produce similar results in terms of performance. However, when performing the same calculation using explicit loops, Julia is noticeably faster than Python.
It is a common sentiment in the Julia community to say: don’t be afraid to use loops and nested loops in Julia, as they tend to be fast. This is especially true for users coming from Python.
That said, when comparing the highly optimized sum functions (jl-sum
and np-sum
), both languages show a significant performance advantage over the loop-based versions.
OCaml
In OCaml following the reference ocaml-bigarray-vs-array I have tried more methods for different libraries of arrays.
Code
open Owl
let array_at m i j = Array.unsafe_get (Array.unsafe_get m i ) j
let sum_array_no_inline n m = let a = ref 0. in
for i = 0 to (n-1) do
for j = 0 to (n-1) do
a := !a +. array_at m i j
done;
done;
!a
let sum_array n m = let a = ref 0. in
for i = 0 to (n-1) do
for j = 0 to (n-1) do
a := !a +. (Array.unsafe_get (Array.unsafe_get m i ) j )
done;
done;
!a
let sum_arrayt n m = let a = ref 0. in
for i = 0 to (n-1) do
for j = 0 to (n-1) do
a := !a +. (Array.unsafe_get (Array.unsafe_get m j ) i )
done;
done;
!a
let sum_big_array n m = let a = ref 0. in
for i = 0 to (n-1) do
for j = 0 to (n-1) do
a := !a +. (Bigarray.Array2.unsafe_get m i j);
done;
done;
!a
let sum_big_arrayt n m = let a = ref 0. in
for i = 0 to (n-1) do
for j = 0 to (n-1) do
a := !a +. (Bigarray.Array2.unsafe_get m j i);
done;
done;
!a
let sum_owl_array n m = let a = ref 0. in
for i = 0 to (n-1) do
for j = 0 to (n-1) do
a := !a +. Mat.get m i j;
done;
done;
!a
let sum_owl_array_lib n m = Mat.sum' m
let time f x =
let start = Unix.gettimeofday () in
let res = f x in
let stop = Unix.gettimeofday () in
Printf.printf ": %f (time %fs)\n%!" res (stop -. start);
res
Results with ocamlc (bytecode)
> ==== Size A[n,n] ======= n=1234 ====
> [ArrayNoInline] : 761505.037418 (time 0.038757s)
> [Array ] : 761505.037418 (time 0.029140s)
> [Arrayt ] : 761505.037418 (time 0.029490s)
> [BigArray ] : 761505.037421 (time 0.035729s)
> [BigArrayt ] : 761505.037421 (time 0.036163s)
> [OwlArray ] : 761505.037418 (time 0.062719s)
> [OwlArrayLib ] : 761505.037418 (time 0.000493s)
>
> ==== Size A[n,n] ======= n=12345 ====
> [ArrayNoInline] : 76201442.335659 (time 3.410532s)
> [Array ] : 76201442.335659 (time 2.576479s)
> [Arrayt ] : 76201442.335612 (time 2.721879s)
> [BigArray ] : 76201442.335401 (time 3.146339s)
> [BigArrayt ] : 76201442.335364 (time 3.787680s)
> [OwlArray ] : 76201442.335659 (time 5.330613s)
> [OwlArrayLib ] : 76201442.335678 (time 0.031934s)
>
> ==== Size A[n,n] ======= n=23456 ====
> [ArrayNoInline] : 275097306.304532 (time 12.392087s)
> [Array ] : 275097306.304532 (time 9.196658s)
> [Arrayt ] : 275097306.304435 (time 10.553999s)
> [BigArray ] : 275097306.304219 (time 11.405834s)
> [BigArrayt ] : 275097306.304346 (time 10.811333s)
> [OwlArray ] : 275097306.304532 (time 16.714575s)
> [OwlArrayLib ] : 275097306.304379 (time 0.113167s)
Results with ocamlopt
> ==== Size A[n,n] ======= n=1234 ====
> [ArrayNoInline] : 761056.305873 (time 0.005656s)
> [Array ] : 761056.305873 (time 0.001035s)
> [Arrayt ] : 761056.305873 (time 0.001341s)
> [BigArray ] : 761056.305869 (time 0.016464s)
> [BigArrayt ] : 761056.305869 (time 0.015194s)
> [OwlArray ] : 761056.305873 (time 0.009826s)
> [OwlArrayLib ] : 761056.305873 (time 0.000259s)
>
> ==== Size A[n,n] ======= n=12345 ====
> [ArrayNoInline] : 76202418.923891 (time 0.407937s)
> [Array ] : 76202418.923891 (time 0.101955s)
> [Arrayt ] : 76202418.923849 (time 0.218512s)
> [BigArray ] : 76202418.923890 (time 1.473963s)
> [BigArrayt ] : 76202418.923885 (time 1.932328s)
> [OwlArray ] : 76202418.923891 (time 0.980752s)
> [OwlArrayLib ] : 76202418.923860 (time 0.032272s)
>
> ==== Size A[n,n] ======= n=23456 ====
> [ArrayNoInline] : 275097850.636376 (time 1.404471s)
> [Array ] : 275097850.636376 (time 0.351583s)
> [Arrayt ] : 275097850.636556 (time 1.019869s)
> [BigArray ] : 275097850.637126 (time 5.146998s)
> [BigArrayt ] : 275097850.637142 (time 7.019139s)
> [OwlArray ] : 275097850.636376 (time 3.568812s)
> [OwlArrayLib ] : 275097850.636496 (time 0.112816s)
Final comparison
When comparing OCaml/Owl with Python and Julia, we observe that:
- Owl’s specific methods are as fast as Julia’s or NumPy’s optimized methods, probably because all rely on similar underlying OpenBLAS libraries (or equivalents).
- OCaml loops are consistently faster than Python loops.
- OCaml loops are competitive with Julia loops—slightly faster when compiled with
ocamlopt
, and a bit slower withocamlc
. - OCaml compiled with
ocamlopt
seems to be the fastest option overall.
IMPORTANT: For Julia, testing with and without the
@inbounds
macro showed no clear performance impact, only a slight improvement for the largest matrix sizes.
For OCaml, replacingArray.unsafe_get
withArray.get
(similar to Julia’s bounds-checked access) also showed only a very slight performance impact, even for large matrices, both withocamlc
andocamlopt
.
No special optimizations were applied in any language (e.g., noflambda
in OCaml).
Discussion
We could conclude that the choice of language may not matter much since, with proper libraries, performance is similar across the three. However, in real-world scenarios, many programmers mix library-optimized methods with language-native functions and loops that are not part of these optimized bindings.
In Python, many everyday users (though maybe not professional developers) still rely heavily on loops, which can be a bottleneck. This has been one motivation to transition to Julia, where loops generally perform better.
From this perspective, using OCaml—where almost all methods provide good performance compared to Python—seems to be a preferred option. Julia is also reasonable, offering a pretty large standard library for numerical methods.
On the other hand, at least for Julia 1.9.0, there remains a challenge often called time-to-first-plot—the Just-In-Time compiler incurs overhead that can make simple scripts slower to start compared to Python. Also, the JIT compiler does not provide significant speed advantages beyond what OCaml’s ahead-of-time compiler already offers. Furthermore, neither Julia nor Python (to my knowledge) provide straightforward ways to compile code into standalone binaries.
Personally, I am happy with my choice to learn OCaml and Owl for scientific numerical computing (but looking forward also to raven-ml).