How to interpret the Erlang crash dumps
Sh**t happens, even when working with Erlang….I know, it is a hard truth, but you can survive to this, believe me!
Sometimes ago I was testing in my own environment a small server I was developing for fun, but everytime I run the test I got an abnormal exit in the runtime system. Luckily, Erlang helped me to tackle the problem using the Erlang CrashDump file (erl_crash.dump), a text file that can be read using either a normal text editor or (suggested) the Crashdump Viewer tool in the observer application.
The aforesaid file is produced in the current directory of the emulator or in the file pointed out by the environment variable ERL_CRASH_DUMP any time an abnormal exit of the Erlang runtime systems happens (for a crash dump to be written, there has to bea writable file system mounted).
The exit may bond mainly to one of the following reasons:
- the function erlang:halt/1 is called explicitly with a string argument from running Erlang code
- the runtime system has detected an error that cannot be handled. In this case, usually the system can’t handle the error because of external limitations, such as running out of memory, or huge number of open sockets or ports. A crash dump due to an internal error may be caused by the system reaching limits in the emulator itself (like the number of atoms in the system, or too many simultaneous ets tables).
Let’s take a look at structure of the erl_crash.dump (please notice that I won’t describe all the sections of the file, just the ones that I feel more interesting):
The first part of the dump shows the creation time for the dump, its reason (also called slogan), the system version, the node from which the dump originates, the compile time of the emulator running the originating node and the number of atoms in the atom table.
The reason for the dump is reported at the beginning of the file as Slogan: <reason>. If the system is halted for the first of the two reasons above (erlang:halt/1), the slogan is the string parameter passed to the BIF, otherwise it is a description generated by the emulator or the (Erlang) kernel. This message should provide enough information to understand the problem.
Here is a brief list of some possible reasons for termination:
- <A>: Cannot allocate <N> bytes of memory (of type “<T>”). – The system has run out of memory. <A> is the allocator that failed to allocate memory, <N> is the number of bytes that <A> tried to allocate, and <T> is the memory block type that the memory was needed for. The most common case is that a process stores huge amounts of data.
- <A>: Cannot reallocate <N> bytes of memory (of type “<T>”). – Same as above with the exception that memory was being reallocated instead of being allocated when the system ran out of memory.
- Unexpected op code N – Error in compiled code, beam file damaged or error in the compiler.
Many other errors may occur, so please check out the online guide!
You should already know that atoms are not garbage collected in Erlang, thus some times the system can crash because of a huge (very huge) number of atoms in the systems, this number is shown as Atoms: <number> in the dump file. This error may be due to a dynamical generation a lot of different atoms using the BIF erlang:list_to_atom/1.
Under the tag =memory you will find information similar to what you can obtain on a living node with erlang:memory(). thus this part will be something like:
=memory total: 10917216 processes: 1133244 processes_used: 1124692 system: 9783972 atom: 211717 atom_used: 191008 binary: 21288 code: 1203662 ets: 31908
In my opinion the Erlang crash dump is very interesting because it contains a listing of each living Erlang process in the system. The process information for one process may look like this:
=proc:<0.0.0> State: Running Name: init Spawned as: otp_ring0:start/2 Spawned by:  Started: Tue Dec 7 15:43:40 2010 Message queue length: 1 Number of heap fragments: 0 Heap fragment data: 0 Link list: [<0.1.0>, <0.5.0>, <0.3.0>] Reductions: 3154 Stack+heap: 6765 OldHeap: 6765 Heap unused: 5714 OldHeap unused: 6765 Program counter: 0xb7805160 (init:sleep/1 + 32) CP: 0x00000000 (invalid)
The tag =proc:<pid> identifies the pid of the process, but much more information are available in every process section. You can for example get the state of the process which will be one of the following:
- Scheduled – The process was scheduled to run but not currently running
- Waiting – The process was waiting for something (in receive).
- Running – The process was currently running. If the BIF erlang:halt/1 was called, this was the process calling it.
- Exiting – The process was on its way to exit.
- Garbing – This is bad luck, the process was garbage collecting when the crash dump was written, the rest of the information for this process is limited.
- Suspended – The process is suspended, either by the BIF erlang:suspend_process/1 or because it is trying to write to a busy port.
- Moreover you can get the registration name of the process (if there is one), which process spawned it, its starting time, the message queue length and the number of reductions consumed by it.
Similar information are given about the ETS tables in the system, loaded modules, functions and (if the Erlang node was set up for communicating with other nodes) active connections.
Note that the crash dump file structure may change between releases of OTP, thus this guide must be seen as a very simple example. For major information take a look at the official web site.