Home > English, Erlang > How to interpret the Erlang crash dumps

How to interpret the Erlang crash dumps

Sh**t happens, even when working with Erlang….I know, it is a hard truth, but you can survive to this, believe me!

Sometimes ago I was testing in my own environment a small server I was developing for fun, but everytime I run the test I got an abnormal exit in the runtime system. Luckily, Erlang helped me to tackle the problem using the Erlang CrashDump file (erl_crash.dump), a text file that can be read using either a normal text editor or (suggested) the Crashdump Viewer tool in the observer application.

The aforesaid file is produced in the current directory of the emulator or in the file pointed out by the environment variable ERL_CRASH_DUMP any time an abnormal exit of the Erlang runtime systems happens (for a crash dump to be written, there has to bea writable file system mounted).

erl_crash.dump overview

The exit may bond mainly to one of the following reasons:

  1. the function erlang:halt/1 is called explicitly with a string argument from running Erlang code
  2. the runtime system has detected an error that cannot be handled. In this case, usually the system can’t handle the error because of external limitations, such as running out of memory, or huge number of open sockets or ports. A crash dump due to an internal error may be caused by the system reaching limits in the emulator itself (like the number of atoms in the system, or too many simultaneous ets tables).

Let’s take a look at structure of the erl_crash.dump (please notice that I won’t describe all the sections of the file, just the ones that I feel more interesting):

The first part of the dump shows the creation time for the dump, its reason (also called slogan), the system version, the node from which the dump originates, the compile time of the emulator running the originating node and the number of atoms in the atom table.

The reason for the dump is reported at the beginning of the file as Slogan: <reason>. If the system is halted for the first of the two reasons above (erlang:halt/1), the slogan is the string parameter passed to the BIF, otherwise it is a description generated by the emulator or the (Erlang) kernel. This message should provide enough information to understand the problem.

Here is a brief list of some possible reasons for termination:

  • <A>: Cannot allocate <N> bytes of memory (of type “<T>”). – The system has run out of memory. <A> is the allocator that failed to allocate memory, <N> is the number of bytes that <A> tried to allocate, and <T> is the memory block type that the memory was needed for. The most common case is that a process stores huge amounts of data.
  • <A>: Cannot reallocate <N> bytes of memory (of type “<T>”). – Same as above with the exception that memory was being reallocated instead of being allocated when the system ran out of memory.
  • Unexpected op code N – Error in compiled code, beam file damaged or error in the compiler.

Many other errors may occur, so please check out the online guide!

You should already know that atoms are not garbage collected in Erlang, thus some times the system can crash because of a huge (very huge) number of atoms in the systems, this number is shown as Atoms: <number> in the dump file. This error may be due to a dynamical generation a lot of different atoms using the BIF erlang:list_to_atom/1.

Under the tag =memory you will find information similar to what you can obtain on a living node with erlang:memory(). thus this part will be something like:

total: 10917216
processes: 1133244
processes_used: 1124692
system: 9783972
atom: 211717
atom_used: 191008
binary: 21288
code: 1203662
ets: 31908

In my opinion the Erlang crash dump is very interesting because it contains a listing of each living Erlang process in the system. The process information for one process may look like this:

State: Running
Name: init
Spawned as: otp_ring0:start/2
Spawned by: []
Started: Tue Dec  7 15:43:40 2010
Message queue length: 1
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.1.0>, <0.5.0>, <0.3.0>]
Reductions: 3154
Stack+heap: 6765
OldHeap: 6765
Heap unused: 5714
OldHeap unused: 6765
Program counter: 0xb7805160 (init:sleep/1 + 32)
CP: 0x00000000 (invalid)

The tag =proc:<pid> identifies the pid of the process, but much more information are available in every process section. You can for example get the state of the process which will be one of the following:

  • Scheduled – The process was scheduled to run but not currently running
  • Waiting – The process was waiting for something (in receive).
  • Running – The process was currently running. If the BIF erlang:halt/1 was called, this was the process calling it.
  • Exiting – The process was on its way to exit.
  • Garbing – This is bad luck, the process was garbage collecting when the crash dump was written, the rest of the information for this process is limited.
  • Suspended – The process is suspended, either by the BIF erlang:suspend_process/1 or because it is trying to write to a busy port.
Moreover you can get the registration name of the process (if there is one), which process spawned it, its starting time, the message queue length and the number of reductions consumed by it.

Similar information are given about the ETS tables in the system, loaded modules, functions and (if the Erlang node was set up for communicating with other nodes) active connections.

Note that the crash dump file structure may change between releases of OTP, thus this guide must be seen as a very simple example. For major information take a look at the official web site.

Categories: English, Erlang Tags: , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: