I am constantly told that Perl has much better regex performance than python. When I ask people how they know they answer with “everybody knows that” or “because it’s native” or I am shown some obscure benchmarks whcih seem to test anything but regex performance (hardcoded regex vs interpolated etc.). I wanted to know, and I wanted to fiddle around with performance analysis since I am dealing with Big-O lately. So, without putting an end to the discussion and more as a base for discussions with colleagues and friends here is what I did:
1. I took a large text (Moby Dick at archive.org
2. I wrote a very small programs in perl and python
3. I read in the whole file and measured the time (to be able to see whether one program takes longer to read or not)
4. I ran the code with regex
5. I changed the regex and ran it again
6. I measured with linux’s “time”
I am however not interested in absolute performance (which is machine dependent) but relative.
Version were
perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
and
Python 2.7.6 (default, Jan 17 2014, 15:43:59) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
The first two scripts were these
import re;
count = 0
with open('mobydick.txt','r') as f:
data = f.read();
#!/usr/sbin/perl -w
use utf8;
use strict;
use warnings;
my $string;
open FILE, "<", "mobydick.txt";
$string = join("", );
close FILE;
Ran them both and got
python py_regex.py 0,02s user 0,02s system 53% cpu 0,069 total
perl pl_regex.pl 0,01s user 0,02s system 70% cpu 0,047 total
Pretty close. So, I don’t have to concern myself with reading speed in the next measurements.
Then I changed the code to include some regexes. I just counted how many times the word “Pequod” was used.
import re;
count = 0
with open('mobydick.txt','r') as f:
data = f.read();
m = re.findall('(Pequod)', data);
for find in m:
print find
count+=1
print "%d" %count
#!/usr/sbin/perl -w
use utf8;
use strict;
use warnings;
my $count = 0;
my $string;
open FILE, "<", "mobydick.txt";
$string = join("", );
close FILE;
my @m = $string =~ /(Pequod)/g;
foreach(@m){
print "$_\n";
$count++;
}
print $count."\n";
Ran them again and got:
Pequod
[...]
Pequod
66
python py_regex.py 0,02s user 0,01s system 89% cpu 0,033 total
And
Pequod
[...]
Pequod
66
perl pl_regex.pl 0,01s user 0,01s system 89% cpu 0,021 total
Okay, that was a little surprising since in the discussions I had before “outperforms” was a term used quite often.
Maybe it was just that the regex was simply not complex enough or something…
Change the regex and keep everything else.
m = re.findall('(.*Pequod:*)\s', data);
my @m = $string =~ /(.*Pequod.*)\s/g;
And run it again
the Pequod. Devil-Dam, I do not know the origin of ;
[...]
SLOWLY wading through the meadows of brit, the Pequod
66
python py_regex.py 0,07s user 0,01s system 95% cpu 0,082 total
Not too bad an increase.
the Pequod. Devil-Dam, I do not know the origin of ;
[...]
SLOWLY wading through the meadows of brit, the Pequod
66
perl pl_regex.pl 18,16s user 0,09s system 99% cpu 18,347 total
GOODNESS ME!!
I still don’t know what happened, but I will ask around…
You should use a newer Perl interpreter, I think.
I can reproduce the phenomenon with Perl 5.18, but starting with Perl 5.20, the execution is _much_ faster.
You may investigate further with
use re ‘Debug’;
See re(3) for details.
Regards
fany
LikeLiked by 1 person
Thanks for pointing out fany. I still think that are generalization like “is faster/better/…” in regard to whatever language is problematic, so testing things seems the only way.
Cheers,
Caspar
LikeLike