PAPI FAQ

General Questions (FAQ)
      I have a question that I think should be added here. Where should I send it?
      How do I install the PAPI library?
      Where do I go for help?
      What are the mailing lists and how do I subscribe?
      Where are the archives for the mailing lists?
      What is needed to use PAPI?
      What tools are available for PAPI?

The PAPI Library
      I downloaded the PAPI 3 tarball last week and keep getting a segmentation fault in gcc. What's up?
      When I make PAPI, I always get a warning message when compiling fmultiplex2. Why?
      How do I convert my code from PAPI 2 to PAPI 3?
      How do I compile PAPI with debugging support?
      How do I use the debugging features of the PAPI library?
      Why does PAPI_overflow, PAPI_profil and PAPI_sprofil work strangely with a small threshold?
      How do I stop PAPI_overflow, PAPI_profile or PAPI_sprofil?
      What events does PAPI track?
      How does PAPI handle threads?
      How does PAPI handle fork/exec?
      Does PAPI support unbound or non-kernel threads?
      How do I encode a native event?
      Why is there more than one patch for Linux?
      The numbers are funky for event 0xabc on platform XYZ, help me!
      My program runs fine when measuring 1 or 2 events, but when I add more I get a -8, PAPI_ECNFLCT error code. The error text says, "Event exists. but cannot be counted due to hardware resource limitations". What does this mean?
      What's multiplexing?
      Why am I still getting PAPI_ECNFLCT when using multiplexing?
      What's a derived event?
      When I compile and run the example program (PAPI_flops.c) on X platform I get the following error message: Error in PAPI_flops: Event exists, but cannot be counted due to hardware resource limits, what is the problem?
      Why can't I get my Fortran programs to compile with PAPI on a Cray T3E?
      What's wrong with PAPI_LST_INS (hex code 0x43) on my Pentium?
      PAPI_create_eventset always returns an error now.
      What's this GCC error about "thread local storage not support for this target"?

The PAPI CVS Source Repository
      How do I access the PAPI CVS tree remotely?
      How do I add a user to CVS so he or she can check in files?
      How do I merge branches of the PAPI project back to the main 'HEAD' trunk?
      How do I add external code (like perfctr) to the papi cvs project?
      How do I remove a directory in the CVS repository without hosing it?
      How do I remove a branch or a tag from a cvs repository?
      How do I branch a branch of a cvs repository?

PAPI on AIX POWER Processors
      General Comments
      Installation notes
      Test case notes
      Counter notes
      Things go haywire on my Power/AIX box with threaded programs?

Any-Null
      General Comments

DADD-Alpha
      What's DADD and where is it installed?
      How does DADD work?

Irix-Mips
      32 and 64 bit libraries
      Native Events for IRIX v6.x on MIPS R10K and R12K processors
      Hardware Multiplexing
      Bugs
      Why when I compile PAPI 3.0, do I get Dl_info undefined in irix-mips.c?
      My code forks and my counts are high, what's the problem?

Linux-Alpha
      General Comments

Linux-IA64
      Floating Point
      Notes on PAPI->Native event mappings
      Why am I getting errors from perfmon and PAPI on my Redhat kernels?
      How do/Should I recompile my (Redhat 2.4.x) IA64 kernel for PAPI?
      Counter interrupts seem to have stopped on my threaded programs?

Linux-Perfctr
      PAPI and the Linux Kernel
      Before you compile
      If you have already patched your kernel
      Hardware interrupt driven counters
      Why do PAPI_LD_INS and PAPI_SR_INS give identical results on Pentium 4?
      Floating point counts on the Pentium 4 series
      Vector instruction counts on the Pentium 4 series
      The memory test sometimes fails on Athlon Processors.
      How do I patch my Linux/Pentium I, II, III, IV, AMD K7, K8 box to work with PAPI?
      Floating Point counts on AMD Opteron

Solaris-Ultra
      General Comments
      Bugs
      My Sun box doesn't have libcpc.h. What should I do?

Tru64-Alpha
      General Comments
      EV6 Native Events
      Known problems

Cray X1
      Counters
      Overhead
      Overflow
      Profiling
      Multiplexing
      Native Events
      Shared Objects
      Cache Information
      Timer latencies
      Other Issues
      Known Bugs

General Questions (FAQ)

I have a question that I think should be added here. Where should I send it?

ptools-perfapi@ptools.org.

back to top


How do I install the PAPI library?

Please see INSTALL.txt in the papi root directory.

back to top


Where do I go for help?

First, read this document thoroughly. Then consult the PAPI Home Page at http://icl.cs.utk.edu/projects/papi. If that doesn't help, then search the archives as mentioned below. If that fails, then send mail to one of the two mailing lists, ptools-perfapi@ptools.org or perfapi-devel@ptools.org. The former is a group for general announcements, questions and miscellaneous topics. The latter is is a discussion group for the developers of PAPI and it receives all CVS update messages. (which can be a significant amount of mail!)

back to top


What are the mailing lists and how do I subscribe?

There are currently two mailing lists, ptools-perfapi, which is a group for general announcements, questions and miscellaneous topics and perfapi-devel, which is a discussion group for the developers of PAPI and it receives all CVS update messages (which can be a significant amount of mail!)

To subscribe to or maintain your subscription to either of the above groups, go to:
lists.cs.utk.edu/listinfo/ptools-perfapi or lists.cs.utk.edu/listinfo/perfapi-devel.

back to top


Where are the archives for the mailing lists?

The archives for the general PAPI mailing list are located at lists.cs.utk.edu/private/ptools-perfapi/. The archives for the developers list are located at lists.cs.utk.edu/private/perfapi-devel/.

back to top


What is needed to use PAPI?

See the Platform section at http://icl.cs.utk.edu/papi/custom/index.html?lid=62&slid=96..

back to top


What tools are available for PAPI?

SOme of the more popular tools using PAPI can be found under the Tools link on the PAPI web page at http://icl.cs.utk.edu/papi/. You can also see the latest list of third party tools and related software at http://icl.cs.utk.edu/papi/links/index.html. If you have a tool to be posted, send it to the mailing list.

back to top



The PAPI Library

I downloaded the PAPI 3 tarball last week and keep getting a segmentation fault in gcc. What's up?

SOme versions of GCC have a bug that is triggered by a statement in PAPI 3.0. This (one character) is fixed in the current tar ball, but may not be in the one you downloaded.

If you see an INTERNAL ERROR from GCC when compiling multiplex.c, do 2 things.

1) edit multiplex.c, line 1021 to have 2 equal signs instead of 1.

2 optional) send a message to your local gcc maintainer and complain.

The actual culprit is:

assert(retval = PAPI_OK) and it should be assert(retval == PAPI_OK)

Of course, both are legal C and nothing should trigger an internal compiler error, but hey...

P.S. If your current release compiled with GCC, you're still ok. As the statement above NEVER gets triggered. It is there as an artifact from the original multiplex.c implementation. So you don't need to change or upgrade your PAPI or gcc.

back to top


When I make PAPI, I always get a warning message when compiling fmultiplex2. Why?

The warning message here is benign, but since it occurs on the last file to be compiled, it often looks like the build has been aborted. The reason the message occurs is that the compiler thinks it is trying to stuff too many bits into an integer value. You can fix it by rearranging the code a little bit. Or just download the latest copy of fmultiplex2.F from the cvs tree.

back to top


How do I convert my code from PAPI 2 to PAPI 3?

PAPI 3 represents a major upgrade to the PAPI library. Because of this, there have been a number of interface changes. The process to upgrade from PAPI 2 to PAPI 3 is straightforward, and documented in the PAPI Conversion Cookbook. You can read it online, or download it in a number of different formats.

back to top


How do I compile PAPI with debugging support?

To compile with debugging, define CFLAGS to include -DDEBUG in the corresponding Makefile or Rules. file.

back to top


How do I use the debugging features of the PAPI library?

To enable debugging messages at run time, set the PAPI_DEBUG environment variable to one or more of the following with any character as a separator.

SUBSTRATE
API
INTERNAL
THREADS
MULTIPLEX
OVERFLOW
PROFILE
ALL

Also, see the man page for PAPI_set_debug().

back to top


Why does PAPI_overflow, PAPI_profil and PAPI_sprofil work strangely with a small threshold?

On most systems, overflow must be emulated in software by PAPI. Only on the UltraSparc III, Itanium and IRIX does the operating system support true interrupt on overflow. Therefore the user is advised on most platforms to make sure the overflow value is no more than 1/1000th the clock rate. The emulation handler in PAPI runs every millisecond, therefore the goal of the tool designer should be to pick an value that will overflow frequently but not too frequently. Not following these guidelines could result in either the overflows never occurring or overflows occurring on every interrupt and thus resulting in a flat profile.

back to top


How do I stop PAPI_overflow, PAPI_profile or PAPI_sprofil?

Call PAPI_stop, and then call PAPI_overflow, PAPI_profile or PAPI_sprofil with a threshold value of 0. Since PAPI 3 can overflow and profile on multiple events, you must call the above routines for EACH event that had been previously enabled for overflow or profile.

back to top


What events does PAPI track?

PAPI only tracks 'hardware events', the occurrence of signals onboard the microprocessor. It does not count system calls, software interrupts or other software events. The user should remember that by default, PAPI only measures events that occur in User Space.

back to top


How does PAPI handle threads?

Currently, PAPI only supports thread level measurements with kernel or bound threads. Each thread must create, manipulate and read its own counters. When a thread is created, it inherits no PAPI events or information from the calling thread.

back to top


How does PAPI handle fork/exec?

When a process is created, it inherits no PAPI information from the calling thread.

back to top


Does PAPI support unbound or non-kernel threads?

Yes, but the counts will reflect the total events for the process. Measurements done in other threads will all get the same values, namely those counts for the total process. For non-bound threads, it is not necessary to call PAPI_thread_init. But in most scenarios like with SMP or OpenMP compiler directives, bound threads will be the default. For those using Pthreads, the user should take care to set the scope of each thread to the PTHREAD_SCOPE_SYSTEM attribute, unless the system is known to have a non hybrid thread library implementation, like Linux.

back to top


How do I encode a native event?

In PAPI2.0: Unless otherwise stated in the FAQ section for your platform, the encoding is as follows:

event = ((reg_code & 0xffffff) << 8 | (reg_num & 0xff))

In PAPI3.0: Just find the native event name and then call PAPI_event_name_to_code. The code returned can be added directly to an event set. The native events can be listed with the test case 'native_avail' in the ctests directory.

back to top


Why is there more than one patch for Linux?

There are numerous patches designed to provide access to the Intel CPU performance counters. As PAPI began, we used the original Beowulf patch (perf) by David Hendriks. However, as PAPI progressed, we needed some addition features, which I graciously added. This patch used a system call approach and has proven to be exceedingly stable. Yes, no crashes reported. I knew that there was a better way to designed a performance counter kernel patch, one that used mmap() to provide direct access to the virtual counts. Mikael Pettersson provided me with exactly that in the form of the perfctr patch. It is also very, very stable. It can be found at http://www.docs.uu.se/~mikpe/linux/perfctr. If you're starting with PAPI for the first time, we recommend the perfctr patch as included in the papi source distribution.

back to top


The numbers are funky for event 0xabc on platform XYZ, help me!

This is not a question, but I'll help you. We the PAPI developers cannot be experts on the 1000's of events found across all supported platforms. However, if you are using a PAPI preset, the first thing to do is to look up the corresponding native event code using the test case 'avail'. Then the best bet is to always go to the vendor's technical documentation site and check the processor reference manual. If you're convinced everything is kosher, then please feel free to send a message to the mailing list and one of the members may be able to help you.

My program runs fine with 1 or 2 counters, but when I add more I get a -8, PAPI_ECNFLCT error code. The error text says, "Event exists. but cannot be counted due to hardware resource limitations". What does this mean?

Many systems have only a few hardware performance counter registers thus you can only measure a few metrics at once. Some platforms may support counter multiplexing, which gives the user the illusion of a larger number of registers by time sharing the performance registers. On the R10K series, the IRIX kernel supports multiplexing, allowing up to 32 events to be counted at once. Don't take fine grained measurements when multiplexing, unless you know what you're doing.

back to top


My program runs fine when measuring 1 or 2 events, but when I add more I get a -8, PAPI_ECNFLCT error code. The error text says, "Event exists. but cannot be counted due to hardware resource limitations". What does this mean?

You have either exceeded the number of available hardware counters or two or more of the events you want to count need the same resources. This can be particularly annoying on machines like the Pentium 4. Although the P4 has 18 nominal counter registers, many events require resources that are restricted to 2 or 3 of these counters. In practice it is often difficult to count more than 4 or 5 simultaneous events on this platform. One way around limited counter resources is to use multiplexing.

back to top


What's multiplexing?

Many systems have only a few hardware performance counter registers; thus you can only measure a few metrics at once. Some platforms may support counter multiplexing, which gives the user the illusion of a larger number of registers by time sharing the performance registers. On the MIPS R10K series, the IRIX kernel supports multiplexing, allowing up to 32 events to be counted at once. On other platforms PAPI does the multiplexing itself, swapping events in and out of the counters based on a timer interrupt. Don't take fine grained measurements when multiplexing, unless you know what you're doing.

back to top


Why am I still getting PAPI_ECNFLCT when using multiplexing?

PAPI multiplexing currently always uses one hardware counter for Total Cycles. If you are trying to multiplex a derived event on hardware with only two physical counters then you will get a PAPI_ECNFLCT error. This happens on the Intel Pentium IIIs for example.

Also, enabling multiplexing is a two-step process. You must call PAPI_multiplex_init() to initialize multiplexing system-wide. You must also call PAPI_set_multiplex() for *each* event set that you want to count in multiplexed mode. If you try to add too many events to an event set where multiplexing has not been set, a PAPI_ECNFLCT error will result.

back to top


What's a derived event?

Hardware counters count low level events that can be directly measured in the hardware. Often these low level events must be combined to form meaningful PAPI preset events. This linear combination of low level events is called a derived PAPI event. Derived events are usually formed by adding or subtracting 2 'native' events, but occasionally derived events can contain 4 or more terms.

back to top


When I compile and run the example program (PAPI_flops.c) on X platform I get the following error message: Error in PAPI_flops: Event exists, but cannot be counted due to hardware resource limits, what is the problem?

Hardware counters are a limited resource. Some PAPI preset events are derived, and require the use of more than one hardware counter. For example, Solaris has 2 counters, both of which are needed to count Floating point instructions. Flops also uses total cycles to measure time. On Solaris this would mean using 3 counters, and those resources aren't available.
If you get this error on any platform, run the avail program in the ctests directory and see how many native events have to be monitored. PAPI_num_counters() can be used to determine how many counters exist on your platform. If there are more native events than counters, then this is the reason you are getting the error.

back to top


Why can't I get my Fortran programs to compile with PAPI on a Cray T3E?

The Fortran header file you include has to be preprocessed before the Fortran file can use it. To have the cpp process the file before sending the file to the compiler, add the -F flag. For example:

f90 -F test.F -o test

back to top


What's wrong with PAPI_LST_INS (hex code 0x43) on my Pentium?

According to the Intel documentation, the counts from this event are not intuitive relating to it's description. Older releases of PAPI had this preset available in the Intel ports, but no longer. It does appear to work on the AMD Athlon.

back to top


PAPI_create_eventset always returns an error now.

The EventSet MUST be set to PAPI_NULL before it is passed into PAPI_create_eventset.

back to top


What's this GCC error about "thread local storage not support for this target"?

TLS is thread local storage, a high performance mechanism in later GCC's/GLIBC/pthread to do constant time access to thread local storage. PAPI uses this if available.

However, many systems (especially IA64 running Debian or SuSE) provide very poor/buggy/non-existent support for this. If you're getting an error during compile (or seg faults on every program during the run), then please rebuild using ./configure.

Other systems don't bother to ship a gcc with this turned on, so you'll get the above error.

./configure has a test to make sure that the thread support is working on your platform.

If you find a case where configure did not detect a broken __thread implementation, please report it to us.

back to top



The PAPI CVS Source Repository

How do I access the PAPI CVS tree remotely?

If you would like to interactively browse the PAPI CVS Repository by the WWW, go to the Web based PAPI CVS Viewer.

To access cvs directly, do the following:

The first time, the Checkout phase:

> setenv CVSROOT :pserver:anonymous@cvs.cs.utk.edu:/cvs/homes/papi
> cvs login
> Password:
> cvs co all|papi|src|man|spec|tools

The next time, the Update phase:

> cd
> cvs update

The last time: :-(

> cd
> cvs logout

back to top


How do I add a user to CVS so he or she can check in files?

This question is meant for the CVS development team.

0) You must be in the group papi.
1) Login to nala.cs.utk.edu.
2) Make a temporary directory called foo.
3) Do the following:
 
> cd foo;
> cvs -d /cvs/homes/papi co CVSROOT
.
.
.
> cd CVSROOT
>
 
4) Add 'username' to the 'writers' file.
5) Add a password to the 'passwd' file with htpasswd program.
 
> /usr/local/apache/bin/htpasswd ~/tmp/CVSROOT/passwd 'username'
New password: <enter password>
Re-type new password: <again>
Added password for user 'username'
>
 
6) Commit the changes back to CVS.
                                                                               
> cvs commit -m "Added user 'username' to the cvs list"
.
.
.
cvs commit: Rebuilding administrative file database                                                

back to top


How do I merge branches of the PAPI project back to the main 'HEAD' trunk?

1) Check out the main trunk of the PAPI project:
        > cvs checkout -P papi
2) Update the main trunk by joining with the branch:
        > cvs update -j papi-2-3-3
3) Tag the branch for future updates:
        > cvs tag papi-2-3-3m1
4) Resolve any outstanding conflicts
                                                                               
5) Commit the modified files back to the main trunk:
        > cvs commit ...
...later...
6) To merge from the same branch again, repeat the steps above,
        but update from the tag created in step 3 rather than
        the branch point. This guarantees that only the changes
        made since the last merge are merged.

back to top


How do I add external code (like perfctr) to the papi cvs project?

The correct way to do it is:

Let's say we're adding a version of PerfCtr 2.6. PAPI keeps only release and major versions of the PerfCtr patch externally. Internally we track all PerfCtr versions as you will see from the CVS command line.
1) Untar the new distribution somewhere OUTSIDE of your CVS working dir and cd to it.

% cd /tmp; tar xfz perfctr-2.6.17.tar.gz
% cd perfctr-2.6.17

2) Import the sources, these will always appear in the HEAD branch of CVS. To make them appear in other branches, see below. Please make sure you use -ko to avoid substitution of keyworks.

% cvs import -ko -m "Import of perfctr 2.6.17" papi/src/perfctr-2.6.x PERFCTR_DIST perfctr-2-6-17

Here PERFCTR_DIST is the "vendor tag" and perfctr-2-6-17 is the release version. ANYONE NOT FOLLOWING THE ABOVE SCHEME WILL BE IN TROUBLE. Then IF we have locally modified our copy (made changes to the distro) CVS will warn us to merge in the local changes with the newly imported sources. CVS will tell you how to resolve the conflicts after the import if necessary.


cd /anywhere/clean
cvs checkout -jperfctr:yesterday -jperfctr-2-6-17 papi/src/perfctr-2.6.x
cd papi/src/perfctr-2.6.x
...edit the conflicting files...
cvs commit


CVS will do the merge for us during the checkout. If there are any conflicts, we have to fix them manually. If either occurred, then we must commit the changes.


*** Note about NON-HEAD branches ***


CVS imports sources only into the HEAD branch. This means that all the other branches only have the vendor branch from the time the branch was created. To merge in new vendor import trees into other branches, go to your working copy of the branch you are working on, merge the two releases and check in the results as below.


cd papi-3-0-8-1+/src/perfctr-2.6.x
cvs update -jperfctr-2-6-13 -jperfctr-2-6-17
cvs commit

We should follow this model for all 3rd party sources in the source tree.

This is all available in further detail in the CVS Manual, under 'import'.

Good information is also available at: efault.net/npat/docs_and_postings/cvs-tracking/cvs-tracking.txt

back to top


How do I remove a directory in the CVS repository without hosing it?

First you remove all the files in the directory (and all subdirectories).
This can be tedious, so do it like this. This assumes GNU grep.
                                                                               
> cd <dir_to_be_removed>
> find . -type f | grep --invert-match CVS | xargs > /tmp/removeme
                                                                               
Now you have a list of all files in all subdirectories to be removed,
not including anything under CVS. Now edit the file and put 'rm' at the
start of the file.
                                                                               
> vi /tmp/removeme
 
So the start of the file looks like:
"rm ./file1 ./dir2/file2 ...."
 
Now execute the remove.
 
> sh /tmp/removeme
 
Good. Now tell CVS to remove the files. We have to edit the /tmp/removeme
again to replace 'rm' with 'cvs delete'.
 
> vi /tmp/removeme
 
So the start of the file looks like:
"cvs delete ./file1 ./dir2/file2 ...."
 
> sh /tmp/removeme
 
[ lots of CVS log messages ]
 
Now commit the changes with an informative log message.
 
> cvs commit.

back to top


How do I remove a branch or a tag from a cvs repository?

This is not something that needs to happen very often, and because of that, it's easy to forget how. We've done it often enough to want to record what we've learned. The cedarqvist manual describes the procedure, but it isn't completely clear. Here's what you do:

> cvs rtag -d <module>

- or -

> cvs rtag -d -B <branch> <module>

where <tag> or <branch> is the name of the label you want to remove, and <module> is usually 'papi'.

CAUTION: Removing tags is risky, but removing branches is downright dangerous. Do it only if you're really sure you need to, and as close to the creation date as possible!

back to top


How do I branch a branch of a cvs repository?

Usually, branches or tags are added to the main trunk of a cvs repository. Occasionally it is desirable to add a branch to an existing branch. Here's how to do it:

> cvs rtag -b -r <oldtag> -- <newtag> <module>

Where <oldtag> is the name of the existing branch, <newtag> is the name of the new branch, and <module> is typically 'papi'.

Here's an example:

> cvs rtag -b -r papi-3-1-0 -- papi-3-2-0 papi

back to top



PAPI on AIX POWER Processors

General Comments

If you are running papi-3.0 on aix5.2 & power4 combo, and seeing failure. It is
most likely caused by the BUG in the KERNEL. You need look for efix for APAR IY57280, or
contact papi team at papi@cs.utk.edu for the fix. Here is the more precise info
from IBM:
the problam was introduced in 5.2 ML3, and fixed in 5.2 ML4 and 5.3.
 
To use PAPI in 64-bit mode on power4:
    make -f Makefile.aix-power4-64bit
        link your program with libpapi64.a or libpapi64.so
         
See: /usr/lpp/pmtoolkit/lib/<arch>.evs for POWER3;
     /usr/pmapi/lib/POWER4.evs and /POWER4.gps for POWER4
 
For threaded programs, you had better:
 
setenv AIXTHREAD_SCOPE S

back to top


Installation notes

AIX 4.3.x:
The current source and Makefile is for pmtoolkit 1.3.
If you have pmtoolkit 1.2 the test cases will fail. For example:
 
      ./tests/avail
      IOT trap
 
This can be remdied by recompiling the PAPI library with the option
-DPMTOOLKIT_1_2 set.
 
AIX 5.x:
The current source is for pmapi 1.4
 
The aix-power substrate is contained in a single source file, but targets
three different configurations.
Conditional compilation directed by three different make files determines
which configuration is targetted. Make sure you select the Makefile that
matches your configuration:
- Makefile.aix-power    for AIX 4.3.x on POWER3
- Makefile.aix5-power3  for AIX 5.x   on POWER3
- Makefile.aix-power4   for AIX 5.x   on POWER4

back to top


Test case notes

The POWER3 and POWER4 have a FMADD instruction. Although this instruction
performs two Floating Point operations, it is counted as one Floating Point
instruction. Because of this, there are situations where PAPI_FP_INS may
produce fewer Floating Point counts than expected.
Further, the Floating Point Instruction event on POWER3 and POWER4 also
counts Floating Point Stores, leading to higher Floating Point counts than
expected. There are occasions where these two effects can cancel each other
out, to produce the right result for the wrong reason!
Note that POWER3 and POWER4 also support an FMA counter (PAPI_FMA_INS).
Thus, a more accurate count of Floating Point Operations can be obtained
by PAPI_FP_INS + PAPI_FMA_INS.
Correcting for the overcount by Floating Point Stores is more difficult,
requiring the use of the native events: PM_FPU_LD_ST_ISSUES and PM_FPU_LD.
The complete expression for Floating Point Operations then becomes:
PAPI_FP_INS + PAPI_FMA_INS - (PM_FPU_LD_ST_ISSUES - PM_FPU_LD)

back to top


Counter notes

The POWER architecture supports up to 8 counters. However, in many cases
events are mutually exclusive and can't be counted simultaneously.
 
On POWER4, events are available only as members of predefined groups.
For more on these groups, see /usr/pmapi/lib/POWER4.gps.
 
The following table, submitted by Joel Malard, indicates
events that cannot be counted simultaneously on POWER3:

back to top


Things go haywire on my Power/AIX box with threaded programs?

It is very important that you set the environment variable AIXTHREAD_SCOPE to "S", which disables user level threads.

back to top



Any-Null

General Comments

This substrate works on all platforms. It is for testing purposes only. This substrate emulates hardware that returns a 2 counter history in hardware registers 0 and 1, 1 being the most recent. The values returned represent the number of register reads performed by the substrate on the counter hardware.

back to top



DADD-Alpha

What's DADD and where is it installed?

One of the PAPI substrates for HP Alpha Tru64 UNIX uses the Dynamic Access to DCPI Data (DADD) API from Hewlett-Packard.
DADD is assumed to be installed in
  /usr/lib/dcpi  (dadd.a)
  /usr/include/dcpi (dadd.h, virtual_counters.h)
  /usr/local/dadd  (dcpid)

back to top


How does DADD work?

DCPI uses instruction sampling to collect performance data, and the DADD counts for PAPI events are estimated from these samples.
For this reason, very short running programs instrumented with PAPI may show undercounts as low as zero. Best results are obtained on programs with long execution times.

back to top



Irix-Mips

32 and 64 bit libraries

Both n32-bit and 64-bit libraries are built for IRIX. To use the shared libraries, you will need to set LD_LIBRARYN32_PATH or LD_LIBRARY64_PATH to the location of the n32-bit libpapi.so or 64-bit libpapi64.so.

back to top


Native Events for IRIX v6.x on MIPS R10K and R12K processors

For *all* the native event names, run native_avail in the ctests subdirectory. For an example of how to use the native event names, see native.c

back to top


Hardware Multiplexing

Hardware multiplexing is implemented and in use for IRIX-MIPS. You can count all 32 events simultaneously, but this must be enabled explicitly as with the other substrates.

back to top


Bugs

There may be two bugs in IRIX64 about it's multiplexing:

First, It seems when multiplexing is in use, the counter result you get by calling ioctl is not multiplied by the number of events counted by the hardware counter.

Second, since there are only two hardware counters available in R10K/R12k, if you use multiple overflow and these overflow events are happen to be counted by the same hardware counter, the result is reasonable. However, if the overflow events are counted by different hardware counters, then one of the result from hardware counter 1 will be abnormal higher than the reasonable result. See: man r10k_counters

back to top


Why when I compile PAPI 3.0, do I get Dl_info undefined in irix-mips.c?

You have an older run-time system than what PAPI was developed on. The fix is to change the definition to 'struct Dl_info'.

back to top


My code forks and my counts are high, what's the problem?

IRIX is brain-dead, and subprocesses automatically inherit the counter state of the parent process. Solution: call PAPI_stop() just before you fork and PAPI_start() after you fork.

back to top



Linux-Alpha

General Comments

Make sure your /usr/src/linux points to the source tree.

The PAPI linux-alpha substrate uses the linux iprobe_4.3 driver and library from Compaq (now HP). This needs to be patched, because not all of the EV6 functions are in the library. Unpack the iprobe code in the usual way, and copy the patch in papi/src/alpha-linux/iprobe.patch to the top of the iprobe directory. Then
 
patch -Np1 < iprobe.patch
 
After that, build and install iprobe according to directions, and set IPROBE_HOME in papi/src/Makefile.linux-alpha.

back to top



Linux-IA64

Floating Point

This version of the substrate always scales PME_FP_OPS_RETIRED_HI, hex code 0xa, even if you are using it as a NATIVE event. Previous versions of PAPI did not scale this event and could produce erroneously low counts for PAPI_FP_OPS or PAPI_FP_INS.

back to top


Notes on PAPI->Native event mappings

PAPI_CA_SNP
PAPI_CA_INV
 Only counts snoops and invalidations from the local processor.
PAPI_TLB_TL
 Counts "real" TLB misses, i.e. misses that cause a VHPT walk or a TLB
 miss trap to the OS. Misses in the L1 TLBs are not counted.
PAPI_FP_STAL
 Counts stalls due to register dependencies and load latencies.
 If the FP pipeline can stall for some other reason (I don't know)
 then those stall cycles won't be counted.

back to top


Why am I getting errors from perfmon and PAPI on my Redhat kernels?

Redhat broke the perfmon kernel interface in their kernels and thus only enabled it for root. In some kernels, its disabled entirely. You can test this by running your papi as root, if it then works, guess what, you have a broken kernel.

The fix is supposed to be in the latest update to RHEL3 and RHEL4. The best thing to do would be to download a kernel.org kernel, rebuild and go.

back to top


How do/Should I recompile my (Redhat 2.4.x) IA64 kernel for PAPI?

The below is for 2.4 kernels:

Rebuilding the IA64 kernel is only advised if you're using Redhat kernels, particularly RedHat Enterprise, which seems to lag in terms of bug support for the perfmon subsystem. We highly advise you to:

1) Download a stock 2.4 kernel 2) Apply the IA64 specific patches from http://www.kernel.org/pub/linux/kernel/ports/ia64/v2.4 .

Latest info about the 2.4 perfmon support can be found at: http://www.hpl.hp.com/research/linux/perfmon/download.php4

You'll then need to reconfigure and rebuild your kernel. Reconfigure your kernel and enable PERFMON support. Then rebuild the kernel. On IA-64 you need to use "make compressed" instead of "make bzImage". You can thank HP for that change.

back to top


Counter interrupts seem to have stopped on my threaded programs?

You are probably on an Altix or a system with a Redhat kernel. The solution for the later is replace the kernel you have with a patched kernel.org kernel, discussed in this section.

Please send us the kernel version if this happens to you. You'll notice it by running the profile_pthreads test case.

If you're an Altix user, then it's best to complain to SGI. But please let us know also.

back to top



Linux-Perfctr

PAPI and the Linux Kernel

PAPI requires your Linux kernel to be patched with the PerfCtr patch. For compatability reasons, we have included this patch here. You should patch your kernel using the PerfCtr distribution found in the papi/src/perfctr directory. The latest distribution can always be obtained from Mikael Petterson's web site although it is not guaranteed to work.
www.csd.uu.se/~mikpe/linux/perfctr/
If you're not sure how to patch, recompile and reinstall your linux kernel, check the Linux HOWTO's on the web.
www.linuxhq.com.

back to top


Before you compile

cd perfctr
more INSTALL
If you're getting compilation errors regarding not being able to find include files, then you're probably running a broken redhat installation.

Edit the path to your kernel include files at the top of either Makefile.linux-perfctr

back to top


If you have already patched your kernel

If you have a properly functioning Perfctr patch from a previous release of PAPI, you will obviously not want to repatch your kernel. PAPI is compatible with PerfCtr 2.4.x and Perfctr 2.6.x.

The x86 Makefiles:
Makefile.linux-perfctr-p3
Makefile.linux-perfctr-p4
Makefile.linux-athlon
Makefile.linux-opteron

To recompile PAPI *not* using the included PerfCtr distribution, you simply pass the PERFCTR argument to the appropriate Makefile.

make -f Makefile.linux-perfctr-p3
PERFCTR=/usr/src/perfctr-2.4.x

To use Perfctr 2.6.x, simply type:
make -f Makefile.linux-perfctr-p3

To use the older version:
make -f Makefile.linux-perfctr-p3 VERSION=2.4.x

Easy huh?

back to top


Hardware interrupt driven counters

YOU MUST COMPILE YOUR KERNEL WITH APIC SUPPORT IF YOU WANT INTERRUPT SUPPORT!
With Perfctr 2.3.3 or later it is possible to make the performance counters generate an interrupt when the counter reaches a certain count. This requires support in the Linux kernel, Perfctr, PAPI and the CPU to work properly.
The necessary kernel support is available if your kernel is compiled with SMP APIC support or uni-processor APIC support compiled in. This is true for 2.4-ac kernels and kernels 2.4.10 or later. This topic is discussed in more detail in Mikael Pettersson's installation instructions for PerfCtr.
Your CPU must be a Pentium 686/AMD K7 or similar which can generate APIC interrupts for performance counter events. This is _not_ true for some mobile Pentiums and early revisions of the AMD K7 or Athlon.
You can verify that all is working by running the perfctr/examples/perfex program with the -i flag. If you do not see "pcint" as one of the flags, you need to recompile your kernel or buy a real CPU. ;-)

back to top


Why do PAPI_LD_INS and PAPI_SR_INS give identical results on Pentium 4?

Counting memory load and store instructions on the Pentium 4 is a two step process. First the desired events are tagged at the front of the pipeline. Then tagged events are counted as they graduate from the end of the pipeline. Unfortunately, the tags are all the same 'color' and can't be differentiated as they exit the pipe. Thus, you can correctly measure LD instructions, or correctly measure SR instructions, but if you try to measure them both at once, you will always get the sum of both operations in both counters. The same applies to PAPI_LST_INS.

This behavior is demonstrated in the test program ctests/p4_lst_ins.c.

The moral of the story is to always use these three events one-at-a-time on Pentium 4 machines.

back to top


Floating point counts on the Pentium 4 series

The Pentium 4 can generate floating point instructions either through the x87 floating point unit or with SSE instructions.
Furthermore SSE can generate either packed (multiple operands in one 128-bit register) or unpacked (signal operand in one 128-bit register) instructions.
Depending on your compiler and settings you will get different instruction mixes.
 
PAPI provides 2 preset events to count floating point operations:
- PAPI_FP_INS counts intstructions passing through the floating point unit;
- PAPI_FP_OPS counts something closer to theoretical floating point operations.
 
To minimize the overlap and maximize the usefulness of these two events on Pentium 4, we have made the following choices:
- PAPI_FP_INS always counts only x87 floating point operations.
- PAPI_FP_OPS counts can be customized as discussed below.
 
Further complicating things is that the Pentium 4 hardware is too restrictive to count all these modes at once, so a decision must be made about what to count.
In order to enable PAPI to count these various mixes, we support 2 methods.
 
1) The PAPI_PENTIUM4_FP_xxx defines.
 
   Set these in the EVENTFLAGS of either the Makefile.linux-perfctr-p4 or
   Makefile.linux-perfctr-em64t.
 
   -DPAPI_PENTIUM4_FP_X87
   -DPAPI_PENTIUM4_FP_X87_SSE_SP
   -DPAPI_PENTIUM4_FP_X87_SSE_DP
   -DPAPI_PENTIUM4_FP_SSE_SP_DP
 
   The predefined value for Nocona/EM64T/Pentium 4 Model 3 is:
 
         -DPAPI_PENTIUM4_FP_X87_SSE_DP.
 
   The predefined value for anything else is:
 
         -DPAPI_PENTIUM4_FP_X87.
 
   If nothing is defined, the substrate defaults to:
 
         -DPAPI_PENTIUM4_FP_X87_SSE_DP.
 
2) The PAPI_PENTIUM4_FP environment variable.
 
   Set this to one or two of the following, and it will change the
   behavior of PAPI_FP_OPS.
 
   X87: count all x87 instructions
   SSE_SP: count all unpacked SSE single precision instructions
   SSE_DP: count all unpacked SSE double precision instructions
 
   Due to the design of the register set, only 2 of the three are countable
   at one time. Sorry folks.

back to top


Vector instruction counts on the Pentium 4 series

PAPI can count 2 different types of vector instructions on the Pentium 4.
Either MMX instructions or packed SSE floating point instructions. These are supported with 2 methods, in a similar fashion to floating point events described above.
 
1) The PAPI_PENTIUM4_VEC_xxx defines.
 
   Set these in the EVENTFLAGS of either the Makefile.linux-perfctr-p4 or
   Makefile.linux-perfctr-em64t.
 
   -DPAPI_PENTIUM4_VEC_MMX
   -DPAPI_PENTIUM4_VEC_SSE
 
   The current default for all platforms is:
 
         -DPAPI_PENTIUM4_VEC_SSE.
 
   If nothing is defined, the substrate defaults to:
 
         -DPAPI_PENTIUM4_VEC_SSE.
 
2) The PAPI_PENTIUM4_VEC environment variable.
 
   Set this to either of the following, and it will change the
   behavior of PAPI_VEC_INS.
 
   SSE: count all packed SSE SP and DP instructions
   MMX: count all 64 and 128 bit MMX instructions

back to top


The memory test sometimes fails on Athlon Processors.

This is a known issue and we are looking in to the cause. Currently, we have no fix or work around.

back to top


How do I patch my Linux/Pentium I, II, III, IV, AMD K7, K8 box to work with PAPI?

See the INSTALL file in papi/src/perfctr-2.6.x. The instructions are very, very simple. Do not use perfctr-2.4.x unless you have to. There is no link of perfctr version to linux kernel version!

back to top


Floating Point counts on AMD Opteron

The AMD Opteron is the first chip series from AMD that can measure and report floating point operations. Two native events measure floating point activity. One measures speculative operations that enter the FP units; the other measures operations that retire from the FP units.

The retired event generates precise event counts that scale with the amount of work done. However, it measures data movement as well as floating point operations, resulting in counts that are consistently significantly higher than the expected theoretical counts, often by factors of 2 or more.

The speculative event can be configured to generate counts of only the operations typically of interest. Since these counts are speculative, they tend to be higher by often widely variable amounts than expected theoretical counts, especially on complex production codes.

PAPI provides 2 preset events to count floating point operations:

- PAPI_FP_INS counts intstructions passing through the floating point unit;
- PAPI_FP_OPS is intended to count something closer to theoretical floating point operations.

To minimize the overlap and maximize the usefulness of these two events on AMD Opteron, we have made the following choices:

- PAPI_FP_INS always counts retired floating point operations. This value will be precise and accurate, but will include FP loads and stores as well as computations.

- PAPI_FP_OPS counts speculative computation operations by default, but can be customized as discussed below.

As an alternative to counting speculative computations, PAPI_FP_OPS can be configured to retired operations corrected for data movement. Unfortunately, the correction factors themselves are speculative, and can lead to undercounting errors similar in magnitude to those seen in the pure speculative counts.

Two methods are provided to allow customization of PAPI_FP_OPS:

1) The PAPI_OPTERON_FP_xxx defines.

Set these in the CFLAGS variable of Makefile.linux-perfctr-opteron.

-DPAPI_OPTERON_FP_RETIRED
-DPAPI_OPTERON_FP_SSE_SP
-DPAPI_OPTERON_FP_SSE_DP
-DPAPI_OPTERON_FP_SPECULATIVE

The default value is equivalent to:

-DPAPI_OPTERON_FP_SPECULATIVE.

2) The PAPI_OPTERON_FP environment variable.

Set this to one of the following, and it will change the behavior of PAPI_FP_OPS.

RETIRED: count all retired FP instructions
SSE_SP: correct retired counts optimized for single precision
SSE_DP: correct retired counts optimized for double precision
SPECULATIVE: count speculative computations (default)

back to top



Solaris-Ultra

General Comments

Assembler stubs for get_tick() and cpu_sync() as well as the following defines have been blatantly stolen from the perfmon code. The author of the package "perfmon" is Richard J. Enbody and the home page for "perfmon" is www.cps.msu.edu/~enbody/perfmon For *all* the native event names, run native_avail in the ctests subdirectory. For how to use the native event names, see native.c

back to top


Bugs

1) Ultra I/II/III/III+ are currently supported;

2) Some of the cache events have documented bugs, see the Sun UltraSparc hardware reference manual.

3) WARNING FOR PEOPLE USING MULTITHREADED LIBRARIES ON SOLARIS 2.8: There is a bug that prevents setitimer() from being called after the process has called pthread() create at any point in time. Therefore if you suspect your communication library is multithreaded, you had better start the instrumentation before initializing it. See multiplex3_pthreads for details.

back to top


My Sun box doesn't have libcpc.h. What should I do?

You didn't check the PAPI Supported Platform Matrix. The hardware counters on SunOS withUltraSparc are only available on Sun OS 5.8 and above. That's Solaris 2.8 for you SVR4 people.

back to top



Tru64-Alpha

General Comments

The PAPI alpha-tru64 substrate requires OSF 5.1 and a pfm device driver patch available from Bill Gray at Compaq (bgray@zk3.dec.com).

back to top


EV6 Native Events

The EV6 has a very small number of countable events.  Only the following PAPI preset events are available:
 
 EV6
   - PAPI_TOT_CYC
   - PAPI_TOT_INS
   - PAPI_RES_STL
   - PAPI_BR_CN
 
The native envents in EV6 are
 {"cycles"},
 {"retinst"},
 {"retcondbranch"},
 {"retdtb1miss"},
 {"retdtb2miss"},
 {"retitbmiss"},
 {"retunaltrap"},
 {"replay"}

back to top


Known problems

- OSF 5.1 and the pfm device driver do not support saving and restoring the counters on context switch. Whoever opens the driver owns the counters.

- Currently multiplexing is not working and as such the tests are skipped.

back to top



Cray X1

Counters

The Cray X-1 has 4 P-Chip counters per MSP (1 for each SSP), in addition there are 4 E-Chip counters and 16 M-Chip counters. Currently, PAPI only supports 1 valuefor each counter, and thus we send back an aggregate of all the counters. For P-Chips it may be desirable to see each SSP counter seperately, as is the case if a MPI job runs 4 processes on a single MSP and we are looking into ways of providing this for the user, but with the current implementation you can only obtain an aggregate. This of course only affects MSP applications. It is unclear which E-Chip and M-Chip counters should be read depending on the MSP the process is running on or if all should be aggregated. We are still examining the correlation between the process and MSP/ssp it is running on and the E-Chip/M-Chip counters. Thus the interface may end up changing to aggregatecertain counters or return all the values as we wish to do for the P-chip. But the user should be aware that all counters are aggregated currently.

back to top


Overhead

Unlike most other platforms, the overhead on the Cray X1 is dependant on what events you are monitoring. Each chip type (P/M/E) requires its own ioctl call to start/stop and read the hardware counters. Because of this if you monitor events from multiple chips your overhead will double if monitoring events from 2 chip types or triple if monitoring from 3 chips types as most of the overhead comes from the ioctl calls.

back to top


Overflow

The Cray X-1 has hardware overflow support for the P-Chip, however the M-Chip and E-Chip do not support hardware overflow. Because of this, when overflowing on events, if only P-Chip events are being overflowed on then we use hardware overflow. But if we are overflowing on M or E-Chip events (alone, together, or in combination with P-Chip events) then we use software overflowing. P-Chips: When in software mode the threshold being set is compared to the aggregate of all the SSP counters, but when in hardware the threshold is compared against each individual SSP. This means if you have a MSP application and 4 streams andyou want 100,000 threshold, you may want to set the threshold to 25,000 as you could possibly hit 399,997 before an overflow happens with a 100,000 overflow. But this depends on your application.

back to top


Profiling

Hardware profiling is not available on the Cray X-1 and so we use overflows and return the pc to do profiling. Because of this we are either using hardware or software overflowing. Read the Overflow section to determine which will be used.

back to top


Multiplexing

Only software multiplexing is available, currently software multiplexing only monitors 1 event per time slice, and so multiplexing is discouraged for the X1 since there are 64 possible counters.

back to top


Native Events

For *all* available native event names, run native_avail under the ctests subdirectory. For more information on the native events SEE: man counters

back to top


Shared Objects

Shared Objects are not supported on the Cray X1 and thus a shared library can not be built and we don't support any shared library information.

back to top


Cache Information

Currently this is none supported, but will be added in future releases.

back to top


Timer latencies

The PAPI_get_real_usec uses the highest resolution timer available on the Cray X1.

back to top


Other Issues

The user should be aware that we do not stop the counters when reading the hardware counters as this would require 2 additional ioctl calls per chips (IE worse case scenario 9 ioctl calls to read all chips versus the 3 ioctl calls to read the chips if we don't stop/start the counters). Because of this the PAPI library will affect some of the counts. Since everything is aggregate we loop through the P-Chips 32 counters and add them up and then copy that information to the users array. Each is a long long and thus this will effect the instruction count. The E and M chip are already aggregated from the hardware and so the PAPI interface only has to copy the results.

back to top


Known Bugs

The test pthreads_zero can possibly crash the X1 system if it is not running the most recent kernel.

back to top