Perl vs Python – Regex

tl;dr

Perl does not outperform Python when it comes to regexes. When the term to match is preceded by “.*” the speed drops significantly.

I am constantly told that Perl has much better regex performance than python. When I ask people how they know they answer with “everybody knows that” or “because it’s native” or I am shown some obscure benchmarks whcih seem to test anything but regex performance (hardcoded regex vs interpolated etc.). I wanted to know, and I wanted to fiddle around with performance analysis since I am dealing with Big-O lately. So, without putting an end to the discussion and more as a base for discussions with colleagues and friends here is what I did:

1. I took a large text (Moby Dick at archive.org

2. I tried to wrote very small programs in perl and python

3. I read in the whole file and measure the time (to be able to see whether one program takes longer to read or not)

4. I ran the code with regex

5. I changed the regex and ran them again

6. I measured with linux time

I am however not interrested in absolute performance (which is machine dependent) but relative.
Version were
perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
and
Python 2.7.6 (default, Jan 17 2014, 15:43:59) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin

The first two scripts were these

import re;

count = 0
with open(‘mobydick.txt’,’r’) as f:
data = f.read();


#!/usr/sbin/perl -w
use utf8;
use strict;
use warnings;

my $string;

open FILE, “<“, “mobydick.txt”;
$string = join(“”, );
close FILE;

Ran them both and got

python py_regex.py 0,02s user 0,02s system 53% cpu 0,069 total

perl pl_regex.pl 0,01s user 0,02s system 70% cpu 0,047 total

Pretty close. So, I don’t have to concern myself with reading speed in the next measurements.

Then I changed the code to include some regexes. I just counted how many times the word “Pequod” was used.

import re;

count = 0
with open(‘mobydick.txt’,’r’) as f:
data = f.read();

m = re.findall(‘(Pequod)’, data);

for find in m:
print find
count+=1

print “%d” %count


#!/usr/sbin/perl -w
use utf8;
use strict;
use warnings;

my $count = 0;
my $string;

open FILE, “<“, “mobydick.txt”;
$string = join(“”, );
close FILE;

my @m = $string =~ /(Pequod)/g;

foreach(@m){
print “$_\n”;
$count++;
}

print $count.”\n”;

Ran them again and got:

Pequod
[...]
Pequod
66
python py_regex.py 0,02s user 0,01s system 89% cpu 0,033 total

And

Pequod
[...]
Pequod
66
perl pl_regex.pl 0,01s user 0,01s system 89% cpu 0,021 total

Okay, that was a little suprising since in the discussions I had before “outperforms” was a term used quite often.
Maybe it was just that the regex was simply not complex enough or something…

Change the regex and keep everything else.

m = re.findall('(.*Pequod.*)\s', data);

my @m = $string =~ /(.*Pequod.*)\s/g;

And run it again

the Pequod. Devil-Dam, I do not know the origin of ;
[...]
SLOWLY wading through the meadows of brit, the Pequod
66
python py_regex.py 0,07s user 0,01s system 95% cpu 0,082 total

Not too bad an increase.

the Pequod. Devil-Dam, I do not know the origin of ;
[...]
SLOWLY wading through the meadows of brit, the Pequod
66
perl pl_regex.pl 18,16s user 0,09s system 99% cpu 18,347 total

GOODNESS ME!!

This drop in speed seems to occur when the matching term is preceded by “.*”.  This might be connected to the lack of variable length look-behind, but that is just me speculating.

But nonetheless I wouldn’t consider Perl as a language for applications dealing with text, as I could never be sure, not to be left with a regex that leads to performance issues in the system.

 

Advertisements