Google Performance Tools
Download/Install CPU
Profiler Heap
Profiler
Download and Install Google
Performance Tools
Version 0.5 (last updated Fri Mar 11 05:58:27 PST 2005) of the
google-perftools has the following 4 tools:
* thread-caching malloc
* heap-checking using tcmalloc
* heap-profiling using tcmalloc
* CPU profiler
The tool claims "The fastest malloc we've seen; works particularly well with
threads and STL. Also: thread-friendly heap-checker, heap-profiler, and cpu-profiler."
They work on Linux 9 and C++ programs. In the README it claims that Google is porting it to
Windows but no clear date when that will happen. The tcmalloc library
performance is impressive, and the CPU profiler works great, especially the
display of the output.
To download, go to
http://sourceforge.net/projects/goog-perftools. To install, read the
INSTALL file in the unzipped directory. Here is the main section on
installation. If you are running on Linux 2.6.x kernel, there are RPMs
ready to install (without having to compile):
1. `cd' to the directory containing the package's source code and type
`./configure' to configure the package for your system. If you're
using `csh' on an old version of System V, you might need to type
`sh ./configure' instead to prevent `csh' from trying to execute
`configure' itself.
Running `configure' takes awhile. While running, it prints some
messages telling which features it is checking for.
2. Type `make' to compile the package.
3. Optionally, type `make check' to run any self-tests that come with
the package.
4. Type `make install' to install the programs and any data files and
documentation. |
CPU Profiler
Your code
can only scale if it efficiently uses CPU cycles - spent CPU cycles doing the
real work, not in overhead. Enter CPU profiling. Google-perftools
CPU profile allows you to profile in two ways: 1) compile the profiling library
into your code, and run it; 2) set env variables if you can't compile a program
because you don't have source. The example graphical output looks like
this - NOTICE: the larges box in the display is the BIGGEST CPU consumer, and so
on so forth, very cool:
This graph is generated by the 'pprof' command, part
of the google-perftools, with the '--gv' option, or ghost view. You must
have 'dot' installed. Dot is distributed by AT&T Bell labs and is
available for restricted use. The other tools are called dotty, neato and tcldot.
You can obtain a non-commercial license for dot/dotty/neato/tcldot and download
the software from the Web page:
http://www.research.att.com/sw/tools/graphviz.
You run a CPU-profiler profiled binary, for example:
|
# ./profiler4_unittest 200 10 /tmp/cpuprofile |
Then you generate analysis report using the 'pprof'
tool, like this:
|
# profiler4_unittest 200 10 /tmp/cpu.prof
# pprof "profiler4_unittest" "/tmp/cpu.prof
460 25.2% 25.2% 460 25.2% __pthread_mutex_lock_internal
399 21.9% 47.1% 399 21.9% __pthread_mutex_unlock_usercnt
196 10.8% 57.9% 196 10.8% vfprintf
156 8.6% 66.5% 156 8.6% __lll_mutex_lock_wait
141 7.7% 74.2% 141 7.7% __lll_mutex_unlock_wake
110 6.0% 80.2% 240 13.2% __vsnprintf
60 3.3% 83.5% 60 3.3% _IO_default_xsputn_internal
50 2.7% 86.3% 50 2.7% _IO_str_init_static_internal
45 2.5% 88.7% 45 2.5% _IO_old_init
35 1.9% 90.7% 462 25.4% __snprintf
34 1.9% 92.5% 34 1.9% __find_specmb
|
Here are some other 'pprof' commands, many of which
generate graphical output:
% pprof --gv "program" "profile"
Generates annotated call-graph and displays via "gv"
% pprof --gv --focus=Mutex "program" "profile"
Restrict to code paths that involve an entry that matches "Mutex"
% pprof --gv --focus=Mutex --ignore=string "program" "profile"
Restrict to code paths that involve an entry that matches "Mutex"
and does not match "string"
% pprof --list=IBF_CheckDocid "program" "profile"
Generates disassembly listing of all routines with at least one
sample that match the --list= pattern. The listing is
annotated with the flat and cumulative sample counts at each line.
|
Check out
details for Google CPU Profiler.
Heap Profile &
TCMalloc
The interesting point google-perftools brings up about memory allocation is
TCMalloc or Thread-Caching malloc:
" Sanjay Ghemawat, Paul Menage <opensource@google.com>
Motivation -
TCMalloc is faster than the glibc 2.3 malloc (available as a separate library
called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes
approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for
small objects). The TCMalloc implementation takes approximately 50 nanoseconds
for the same operation pair. Speed is important for a malloc implementation
because if malloc is not fast enough, application writers are inclined to write
their own custom free lists on top of malloc. This can lead to extra complexity,
and more memory usage unless the application writer is very careful to
appropriately size the free lists and scavenge idle objects out of the free
list.
TCMalloc also reduces lock contention for multi-threaded programs. For small
objects, there is virtually zero contention. For large objects, TCMalloc tries
to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock
contention by using per-thread arenas but there is a big problem with
ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from
one arena to another. This can lead to huge amounts of wasted space. For
example, in one Google application, the first phase would allocate approximately
300MB of memory for its data structures. When the first phase finished, a second
phase would be started in the same address space. If this second phase was
assigned a different arena than the one used by the first phase, this phase
would not reuse any of the memory left after the first phase and would add
another 300MB to the address space. Similar memory blowup problems were also
noticed in other applications.
Another benefit of TCMalloc is space-efficient representation of small
objects. For example, N 8-byte objects can be allocated while using space
approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2
uses a four-byte header for each object and (I think) rounds up the size to a
multiple of 8 bytes and ends up using 16N bytes."
See also:
Automatic Leaks Checking Support
Profiling heap usage |