Categories
Reverse Engineering

Getting started with Reverse Engineering binaries Part 1: Introduction

Motivation

Ever wondered how that binary gets created when you finish compiling? I did too. So I decided to find out…

What is a binary anyway?

On the face of it, a binary is just a file that adheres to whatever platform it is built for. On GNU Linux for example, this would be in the Executable and Linking Format (ELF). On the Microsoft Windows family of operating systems this would be the Portable Executable (PE) format for executable image files and Common Object File Format (COFF) of the object files.

Executable Linking Format (ELF)

The ELF format is described in three different formats:

  1. Relocatable file
    • Has code and data applicable to linking with other object files to create an executable or a shared object file
  2. Shared object file 
    • Contains data and code used for processing by the link editor and the dynamic linker.
    • The link editor takes the file, and with other relocatable and shared object files creates another object file.
    • The dynamic linker combines the file with an executable file and other shared object files to create a process image.
  3. Executable file
    • Comprises the code and data intended for execution. The file specifies the creation of a program’s process image.
Side by side view of ELF files for linking and execution.
                Side by side view of ELF files; linking view and execution view.

Portable Executable Format (PE)

The Portable Executable (PE) format was derived from Common object File Format (COFF), which itself was introduced with the Unix System V. The optional part of the PE header(Image Optional Header) differentiates between an executable and an object file.

Image of PE file format.
          Format of PE file.

How do I view it?

On a typical Linux desktop system, you can run programs like xxdreadelf or objdump on a binary to view the above information. These programs have many options though; it is easy to get overwhelmed by the output! Below is an example of running readelf against the program /bin/ls on an Ubuntu system. The option -l is for listing the program headers headers.

Image of readelf command output with the -l flag
Output of readelf command with the -l flag

On a windows desktop, you can install one of the many PE file browsers: PE Explorer, PE View, PEBrowse Professional,  PEBrowse Professional Interactive just to name a few. Below is a screenshot of PEview on one of the internal binaries in windows, C:\Windows\System32\where.exe

Image of PEview on windows
PE file view from PEview

The Story of Memory

The above information is nice and shiny, however understanding it would be challenging without first understanding how a binary actually gets read and executed by the operating system. A process, which is a binary in execution, is loaded when the user runs the program. Typically, processes are mapped into memory in some variation of the diagram below.

image of a process in memory
32bit process in memory

On a 32bit machine, a process has about 232 − 1 (4 GiB) of virtual address space in memory. On a 64bit machine,  this is 264 − 1. Virtual addressing is a concept that allows a process to ‘think’ it has more space than is actually available. Translation from virtual memory addresses to physical addresses is handled by a Memory Management Unit. Thus for each binary loaded, the memory space always seems to be starting from 0x00000000. For the diagram above, the anatomy of the process is further broken down as follows:

  • text
    • Contains executable instructions and is readonly to avoid instruction modification. It is also sharable such that only a single copy is ever in memory for programs that are frequently executed
  • data
    • Further broken down into initialised (.data) and uninitialised (.bss) data
    • .data contains global or static variables which have a pre-defined value and can be modified
    • .bss contains all global variables and static variables that are initialised to zero or do not have explicit initialisation in source code
  • heap
    • Is memory set aside for dynamic allocation. In C and C++ this consists of data created with the malloc or new keywords
    • The heap grows upwards; the higher the address, the more the dynamic memory in use
    • Shared by all threads, shared libraries and dynamically loaded modules called in a given process
  • stack
    • Contains data local to a function.
    • In contrast to the heap; the stack grows downwards; the higher the address, the lower the stack address
    • Data goes out of scope as soon as function terminates

Depending on the platform, there are many other sections and/or segments involved. The above however are typically common across mostplatforms.

Part 1 Summary

Now it should be somewhat clear how a binary gets to be in its usable form. Next,  we look a little closer at how the various parts of the binary get manipulated. With this information it should then be sufficient to get to tackling a typical binary. On to Part 2.

Leave a Reply

Your email address will not be published. Required fields are marked *