This is the script for my talk at YAPC::EU::2002. It's not a transcript of what I actually said; rather it collects together the notes that were in the pod source to the slides, the notes scribbled on various bits of paper, the notes that were only in my head, and tries to make a coherent text. I've also tried to add in the useful feedback I got - sometimes I can even remember who said it and so give them credit.
The slides are here, and hopefully it will be obvious where the slide changes are.
So you have a perl script. And it's too slow. And you want to do something about it. This is a talk about what you can do to speed it up, and also how you try to avoid the problem in the first place.
But these may not be practical or politically acceptable solutions.
So you can compromise.
Here's my perl script making a call to a perl function rot32. And here's a
C function rot32 that takes 2 integers, rotates the first by the second,
and returns an integer result. That's all you need! And you run it and it
works.
#!/usr/local/bin/perl -w
use strict;
printf "$_:\t%08X\t%08X\n", rot32 (0xdead, $_), rot32 (0xbeef, -$_)
foreach (0..31);
use Inline C => <<'EOC';
unsigned rot32 (unsigned val, int by) {
if (by >= 0)
return (val >> by) | (val << (32 - by));
return (val << -by) | (val >> (32 + by));
}
EOC
__END__
0: 0000DEAD 0000BEEF
1: 80006F56 00017DDE
2: 400037AB 0002FBBC
3: A0001BD5 0005F778
4: D0000DEA 000BEEF0
...Are you using the best features of the language?
I trust you're not doing that. But are you keeping your arrays nicely sorted so that you can do a binary search? That's fast. But using a hash should be faster.
\G anchor and the /gc flags may
still be faster.
if ( /\G.../gc ) {
...
} elsif ( /\G.../gc ) {
...
} elsif ( /\G.../gc ) {pack and unpackundefAre you calculating something only to throw it away?
For example the script in the Encode module that compiles character conversion tables would print out a warning if it saw the same character twice. If you or I build perl we'll just let those build warnings scroll off the screen - we don't care - we can't do anything about it. And it turned out that keeping track of everything needed to generate those warnings was slowing things down considerably. So I added a flag to disable that code, and perl 5.8 defaults to use it, so it builds more quickly.
Various helpful hecklers (most of London.pm who saw the talk (and I'm counting David Adler as part of London.pm as he's subscribed to the list)) wanted me to remind people that you really really don't want to be optimising unless you absolutely have to. You're making your code harder to maintain, harder to extend, and easier to introduce new bugs into. Probably you've done something wrong to get to the point where you need to optimise in the first place.
I agree.
Also, I'm not going to change the running order of the slides. There isn't a good order to try to describe things in, and some of the ideas that follow are actually more "good practice" than optimisation techniques, so possibly ought to come before the slides on finding slowness. I'll mark what I think are good habits to get into, and once you understand the techniques then I'd hope that you'd use them automatically when you first write code. That way (hopefully) your code will never be so slow that you actually want to do some of the brute force optimising I describe here.
[At this point the audience is supposed to laugh nervously, because I'm betting that very few people are in this desirable situation of having comprehensive tests written]
You can never have enough memory, and it's never fast enough.
Computer memory is like a pyramid. At the point you have the CPU and its registers, which are very small and very fast to access. Then you have 1 or more levels of cache, which is larger, close by and fast to access. Then you have main memory, which is quite large, but further away so slower to access. Then at the base you have disk acting as virtual memory, which is huge, but very slow.
Now, if your program is swapping out to disk, you'll realise, because the OS can tell you that it only took 10 seconds of CPU, but 60 seconds elapsed, so you know it spent 50 seconds waiting for disk and that's your speed problem. But if your data is big enough to fit in main RAM, but doesn't all sit in the cache, then the CPU will keep having to wait for data from main RAM. And the OS timers I described count that in the CPU time, so it may not be obvious that memory use is actually your problem.
This is the original code for the part of the Encode compiler (enc2xs) that
generates the warnings on duplicate characters:
if (exists $seen{$uch}) {
warn sprintf("U%04X is %02X%02X and %02X%02X\n",
$val,$page,$ch,@{$seen{$uch}});
}
else {
$seen{$uch} = [$page,$ch];
}
It uses the hash %seen to remember all the Unicode characters that it has
processed. The first time that it meets a character it won't be in the hash,
the exists is false, so the else block executes. It stores an arrayref
containing the code page and character number in that page. That's three
things per character, and there are a lot of characters in Chinese.
If it ever sees the same Unicode character again, it prints a warning
message. The warning message is just a string, and this is the only place
that uses the data in %seen. So I changed the code - I pre-formatted that
bit of the error message, and stored a single scalar rather than the three:
if (exists $seen{$uch}) {
warn sprintf("U%04X is %02X%02X and %04X\n",
$val,$page,$ch,$seen{$uch});
}
else {
$seen{$uch} = $page << 8 | $ch;
}
That reduced the memory usage by a third, and it runs more quickly.
How do you make things faster? Well, this is something of a black art, down to trial and error. I'll expand on aspects of these 4 points in the next slides.
By having commented out slower code near the faster code you can look back and get ideas for other places you might optimise in the same way.
These are things that I would consider good practice, so you ought to be doing them as a matter of routine.
AutoSplit and AutoLoaderOne potential problem is that the way AutoLoader brings in subroutines makes
debugging confusing, which can be a problem. While developing, you can disable
AutoLoader by commenting out the __END__ statement marking the start of
your AutoLoaded subroutines. That way, they are loaded, compiled and debugged
in the normal fashion.
... 1; # While debugging, disable AutoLoader like this: # __END__ ...
Of course, to do this you'll need another 1; at the end of the AutoLoaded
section to keep use happy, and possibly another __END__.
Schwern notes that commenting out __END__ can cause surprises if the main
body of your module is running under use strict; because now your
AutoLoaded subroutines will suddenly find themselves being run under use
strict. This is arguably a bug in the current AutoSplit - when it runs at
install time to generate the files for AutoLoader to use it doesn't add
lines such as use strict; or use warnings; to ensure that the split out
subroutines are in the same environment as was current at the __END__
statement. This may be fixed in 5.10.
Elizabeth Mattijsen notes that there are different memory use versus memory
shared issues when running under mod_perl, with different optimal solutions
depending on whether your apache is forking or threaded.
=pod @ __END__#!perl -w use strict;
=head1 You don't want to do that
big block of pod
=cut
... 1; __END__
=head1 You want to do this
If you put your pod after an __END__ statement then the perl parser will
never even see it. This will save a small amount of CPU, but if you have a
lot of pod (>4K) then it might also mean that the last disk block(s) of a
file are never even read in to RAM. This may gain you some speed. [A helpful
heckler observed that modern raid systems may well be reading in 64K chunks,
and modern OSes are getting good at read ahead, so not reading a block as
a result of =pod @ __END__ may actually be quite rare.]
If you are putting your pod (and tests) next to their functions' code (which is probably a better approach anyway) then this advice is not relevant to you.
Exporter is written in perl. It's fast, but not instant.
Most modules are able to export lots of their functions and other symbols into your namespace to save you typing. If you have only one argument to use, such as
use POSIX; # Exports all the defaults
then POSIX will helpfully export its default list of symbols into your
namespace. If you have a list after the module name, then that is taken as a
list of symbols to export. If the list is empty, no symbols are exported:
use POSIX (); # Exports nothing.
You can still use all the functions and other symbols - you just have to
use their full name, by typing POSIX:: at the front. Some people argue
that this actually makes your code clearer, as it is now obvious where each
subroutine is defined. Independent of that, it's faster:
| use POSIX; | use POSIX (); |
| 0.516s | 0.355s |
| use Socket; | use Socket (); |
| 0.270s | 0.231s |
POSIX exports a lot of symbols by default. If you tell it to export none,
it starts in 30% less time. Socket starts in 15% less time.
$& variable returns the last text successfully matched in any regular
expression. It's not lexically scoped, so unlike the match variables $1 etc
it isn't reset when you leave a block. This means that to be correct perl has
to keep track of it from any match, as perl has no idea when it might be
needed. As it involves taking a copy of the matched string, it's expensive for
perl to keep track of. If you never mention $&, then perl knows it can
cheat and never store it. But if you (or any module) mentions $&
anywhere then perl has to keep track of it throughout the script, which slows
things down. So it's a good idea to capture the whole match explicitly if
that's what you need.
$text =~ /.* rules/;
$line = $&; # Now every match will copy $& - slow
$text =~ /(.* rules)/;
$line = $1; # Didn't mention $& - fastuse English gives helpful long names to all the punctuation variables.
Unfortunately that includes aliasing $& to $MATCH which makes perl think
that it needs to copy every match into $&, even if you script never
actually uses it. In perl 5.8 you can say use English '-no_match_vars'; to
avoid mentioning the naughty "word", but this isn't available in earlier
versions of perl.$1 etc, so it all you need
is grouping use a the non-capturing (?:...) instead of the capturing
(...)./o flag to tell perl, and it will never waste time checking or
recompiling it.qr// operator to pre-compile your regexps. It often is the
easiest way to write regexp components to build up more complex regexps.
Using it to build your regexps once is a good idea. But don't screw up (like
parrot's assemble.pl did) by telling perl to recompile the same regexp
every time you enter a subroutine:
sub foo {
my $reg1 = qr/.../;
my $reg2 = qr/... $reg1 .../;
You should pull those two regexp definitions out of the subroutine into package variables, or file scoped lexicals.
You find what is slow by using a profiler. People often guess where they think their program is slow, and get it hopelessly wrong. Use a profiler.
Devel::DProf is in the perl core from version 5.6. If you're using an
earlier perl you can get it from CPAN.
You run your program with -d:DProf
perl5.8.0 -d:DProf enc2xs.orig -Q -O -o /dev/null ...
which times things and stores the data in a file named tmon.out. Then you
run dprofpp to process the tmon.out file, and produce meaningful
summary information. This excerpt is the default length and format, but you
can use options to change things - see the man page. It also seems to show up
a minor bug in dprofpp, because it manages to total things up to get 106%.
While that's not right, it doesn't affect the explanation.
Total Elapsed Time = 66.85123 Seconds
User+System Time = 62.35543 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
106. 66.70 102.59 218881 0.0003 0.0005 main::enter
49.5 30.86 91.767 6 5.1443 15.294 main::compile_ucm
19.2 12.01 8.333 45242 0.0003 0.0002 main::encode_U
4.74 2.953 1.078 45242 0.0001 0.0000 utf8::unicode_to_native
4.16 2.595 0.718 45242 0.0001 0.0000 utf8::encode
0.09 0.055 0.054 5 0.0109 0.0108 main::BEGIN
0.01 0.008 0.008 1 0.0078 0.0078 Getopt::Std::getopts
0.00 0.000 -0.000 1 0.0000 - Exporter::import
0.00 0.000 -0.000 3 0.0000 - strict::bits
0.00 0.000 -0.000 1 0.0000 - strict::import
0.00 0.000 -0.000 2 0.0000 - strict::unimport
At the top of the list, the subroutine enter takes about half the total CPU
time, with 200,000 calls, each very fast. That makes it a good candidate to
optimise, because all you have to do is make a slight change that gives a
small speedup, and that gain will be magnified 200,000 times. [It turned out
that enter was tail recursive, and part of the speed gain I got was by
making it loop instead]
Third on the list is encode_U, which with 45,000 calls is similar, and worth
looking at. [Actually, it was trivial code and in the real enc2xs I
inlined it]
utf8::unicode_to_native and utf8::encode are built-ins, so you won't be
able to change that.
Don't bother below there, as you've accounted for 90% of total program time, so even if you did a perfect job on everything else, you could only make the program run 10% faster.
compile_ucm is trickier - it's only called 6 times, so it's not obvious
where to look for what's slow. Maybe there's a loop with many iterations.
But now you're guessing, which isn't good.
One trick is to break it into several subroutines, just for benchmarking, so that DProf gives you times for different bits. That way you can see where the juicy bits to optimise are.
Devel::SmallProf should do line by line profiling, but every time I use
it it seems to crash.
Now you've identified the slow spots, you need to try alternative code to
see if you can find something faster. The Benchmark module makes this easy.
A particularly good subroutine is cmpthese, which takes code snippets and
plots a chart. cmpthese was added to Benchmark with perl 5.6.
So to compare two code snippets orig and new by running each for 10000
times you'd do this:
use Benchmark ':all';
sub orig {
...
}
sub new {
...
}
cmpthese (10000, { orig => \&orig, new => \&new } );
Benchmark runs both, times them, and then prints out a helpful comparison chart:
Benchmark: timing 10000 iterations of new, orig...
new: 1 wallclock secs ( 0.70 usr + 0.00 sys = 0.70 CPU) @ 14222.22/s (n=10000)
orig: 4 wallclock secs ( 3.94 usr + 0.00 sys = 3.94 CPU) @ 2539.68/s (n=10000)
Rate orig new
orig 2540/s -- -82%
new 14222/s 460% --
and it's plain to see that my new code is over 4 times as fast as my original code.
Actually, I didn't tell the whole truth earlier about what causes slowness in perl. [And astute hecklers such as Philip Newton had already told me this]
When perl compilers your program it breaks it down into a sequence of
operations it must perform, which are usually referred to as ops.
So when you ask perl to compute $a = $b + $c it actually breaks it down
into these ops:
$b onto the stack$c onto the stack$aComputers are fast at simple things like addition. But there is quite a lot of overhead involved in keeping track of "which op am I currently performing" and "where is the next op", and this book-keeping often swamps the time taken to actually run the ops. So often in perl it's the number of ops your program takes to perform its task that is more important than the CPU they use or the RAM it needs. The hit list is
So what were my example code snippets that I Benchmarked?
It was code to split a line of hex (54726164696e67207374796c652f6d61) into
groups of 4 digits (5472 6164 696e ...) , and convert each to a number
sub orig {
map {hex $_} $line =~ /(....)/g;
}
sub new {
unpack "n*", pack "H*", $line;
}
The two produce the same results:
| orig | new |
|---|---|
| 21618, 24932, 26990, 26400, 29556, 31084, 25903, 28001, 26990, 29793, 26990, 24930, 26988, 26996, 31008, 26223, 29216, 29552, 25957, 25646 | 21618, 24932, 26990, 26400, 29556, 31084, 25903, 28001, 26990, 29793, 26990, 24930, 26988, 26996, 31008, 26223, 29216, 29552, 25957, 25646 |
but the first one is much slower. Why? Following the data path from right to
left, it starts well with a global regexp, which is only one op and therefore
a fast way to generate a list of the 4 digit groups. But that map block is
actually an implicit loop, so for each 4 digit block it iterates round and
repeatedly calls hex. Thats at least one op for every list item.
Whereas the second one has no loops in it, implicit or explicit. It uses one
pack to convert the hex temporarily into a binary string, and then one
unpack to convert that string into a list of numbers. n is big endian
16 bit quantities. I didn't know that - I had to look it up. But when the
profiler told me that this part of the original code was a performance
bottleneck, the first think that I did was to look at the the pack docs to
see if I could use some sort of pack/unpack as a speedier replacement.
You can ask perl to tell you the ops that it generates for particular code
with the Terse backend to the compiler. For example, here's a 1 liner
to show the ops in the original code:
$ perl -MO=Terse -e'map {hex $_} $line =~ /(....)/g;'
LISTOP (0x16d9c8) leave [1]
OP (0x16d9f0) enter
COP (0x16d988) nextstate
LOGOP (0x16d940) mapwhile [2]
LISTOP (0x16d8f8) mapstart
OP (0x16d920) pushmark
UNOP (0x16d968) null
UNOP (0x16d7e0) null
LISTOP (0x115370) scope
OP (0x16bb40) null [174]
UNOP (0x16d6e0) hex [1]
UNOP (0x16d6c0) null [15]
SVOP (0x10e6b8) gvsv GV (0xf4224) *_
PMOP (0x114b28) match /(....)/
UNOP (0x16d7b0) null [15]
SVOP (0x16d700) gvsv GV (0x111f10) *line
At the bottom you can see how the match /(....)/ is just one op. But the
next diagonal line of ops from mapwhile down to the match are all the
ops that make up the map. Lots of them. And they get run each time
round map's loop. [Note also that the {}s mean that map enters scope
each time round the loop. That not a trivially cheap op either]
Whereas my replacement code looks like this:
$ perl -MO=Terse -e'unpack "n*", pack "H*", $line;'
LISTOP (0x16d818) leave [1]
OP (0x16d840) enter
COP (0x16bb40) nextstate
LISTOP (0x16d7d0) unpack
OP (0x16d7f8) null [3]
SVOP (0x10e6b8) const PV (0x111f94) "n*"
LISTOP (0x115370) pack [1]
OP (0x16d7b0) pushmark
SVOP (0x16d6c0) const PV (0x111f10) "H*"
UNOP (0x16d790) null [15]
SVOP (0x16d6e0) gvsv GV (0x111f34) *line
There are less ops in total. And no loops, so all the ops you see execute only once. :-)
[My helpful hecklers pointed out that it's hard to work out what an op is. Good call. There's roughly one op per symbol (function, operator, variable name, and any other bit of perl syntax). So if you golf down the number of functions and operators your program runs, then you'll be reducing the number of ops.]
[These were supposed to be the bonus slides. I talked to fast (quelle surprise) and so manage to actually get through the lot with time for questions]
Memoize follows the grand perl tradition by trading memory for
speed. You tell Memoize the name(s) of functions you'd like to speed up,
and it does symbol table games to transparently intercept calls to them. It
looks at the parameters the function was called with, and uses them to decide
what to do next. If it hasn't seen a particular set of parameters before, it
calls the original function with the parameters. However, before returning the
result, it stores it in a hash for that function, keyed by the function's
parameters. If it has seen the parameters before, then it just returns the
result direct from the hash, without even bothering to call the function.Memoize uses is a regular perl hash. This means that you can tie
the hash to a disk file. This allows Memoize to remember things across runs of
your program. That way, you could use Memoize in a CGI to cache static
content that you only generate on demand (but remember you'll need file
locking). The first person who requests something has to wait for the
generation routine, but everyone else gets it straight from the cache. You can
also arrange for another program to periodically expire results from the
cache.
As of 5.8 Memoize module has been assimilated into the core. Users of
earlier perl can get it from CPAN.
These are quite general ideas for optimisation that aren't particularly perl specific.
enc2xs was calling a function each time round a loop based on a hash
lookup using $type as the key. The value of $type didn't change, so I
pulled the lookup out above the loop into a lexical variable:
my $type_func = $encode_types{$type};
and doing it only once was faster.
enc2xs was calling a function which took
several arguments from a small number of places. The function contained code
to set defaults if some of the arguments were not supplied. I found that the
way the program ran, most of the calls passed in all the values and didn't
need the defaults. Changing the function to not set defaults, and writing
those defaults out explicitly where needed bought me a speed up.perl can't spot that it could just throw away the old lexicals and re-use their space, but you can, so you can save CPU and RAM by re-writing your tail recursive subroutines with loops. In general, trying to reduce recursion by replacing it with iterative algorithms should speed things up.
y, or tr, is the transliteration operator. It's not as powerful as the
general purpose regular expression engine, but for the things it can do it is
often faster.
tr doesn't delete characters unless you use the /d flag. If you don't
even have any replacement characters then it treats its target as read only.
In scalar context it returns the number of characters that matched. It's the
fastest way to count the number of occurrences of single characters and
character ranges. (ie it's faster than counting the elements returned by
m/.../g in list context. But if you just want to see whether one
or more of a character is present use m/.../, because it will stop
at the u first, whereas tr/// has to go to the end)tr is also faster than the regexp engine for doing character-for-character
substitutions.tr is faster than the regexp engines for doing character range deletions.
[When writing the slide I assumed that it would be faster for single character
deletions, but I Benchmarked things and found that s///g was faster for
them. So never guess timings; always test things. You'll be surprised, but
that's better than being wrong]
Another example lifted straight from enc2xs of something that I managed to
accelerate quite a bit by reducing the number of ops run. The code takes a
scalar, and prints out each byte as \x followed by 2 digits of hex, as it's
generating C source code:
#foreach my $c (split(//,$out_bytes)) {
# $s .= sprintf "\\x%02X",ord($c);
#}
# 9.5% faster changing that loop to this:
$s .= sprintf +("\\x%02X" x length $out_bytes), unpack "C*", $out_bytes;
The original makes a temporary list with split [not bad in itself - ops are
more important than CPU or RAM] and then loops over it. Each time round the
loop it executes several ops, including using ord to convert the byte to its
numeric value, and then using sprintf with the format "\\x%02X" to convert
that number to the C source.
The new code effectively merges the split and looped ord into one op,
using unpack's C format to generate the list of numeric values directly.
The more interesting (arguably sick) part is the format to sprintf, which
is inside +(...). You can see from the .= in the original that the
code is just concatenating the converted form of each byte together. So
instead of making sprintf convert each value in turn, only for perl ops to
stick them together, I use x to replicate the per-byte format string once
for each byte I'm about to convert. There's now one "\\x%02X" for each
of the numbers in the list passed from unpack to sprintf, so sprintf
just does what it's told. And sprintf is faster than perl ops.
pack, unpack and
sprintf. So why not use them?
All the pack and unpack code is implemented in pure C, so doesn't have
any of the book-keeping overhead of perl ops. sprintf too is pure C, so
it's fast. The regexp engine uses its own private bytecode, but it's
specially tuned for regexps, so it runs much faster than general perl code.
And the implementation of tr has less to do than the regexp engine, so it's
faster.
For maximum power, remember that you can generate regexps and the formats for
pack, unpack and sprintf at run time, based on your data.
$&, use
(?:...) when you don't need capturing, and put the /o flag on constant
regexps.© 2002 Nicholas Clark