Categories
Reverse Engineering

Getting started with Reverse Engineering binaries Part 2: A closer look

Part 1 introduced the basic idea of what a binary is and how it is mapped in memory. This part continues on that path, by looking at registers and how data is represented at the processor level. The following focuses on Intel architecture.

A Note On Registers

In order for the program to execute, there is a need for a way to transfer and store chunks of a program’s flow during execution. This is where registers come in.

A register is the quickest available location to store and retrieve information for the cpu.

x86 systems running Intel processors typically come with 16 basic program registers used in general system and application programming.

They are grouped into 4 categories:

General Purpose Registers

These are 8 registers for storing operands and pointers. Their special uses can be listed as:

  • eax – accumulator for operands and results data
  • ebx – pointer to data in the DS segment
  • ecx – counter for string and loop operations
  • edx – I/O pointer
  • esi – pointer to data in the segment printed to by the DS register ; a source pointer for string operations
  • edi – pointer to data in the segment printed to by the es register; destination pointer for string operations
  • esp – stack pointer in the SS segment
  • ebp – pointer to data on the stack in the SS segment

These 32-bit registers can alternatively be addressed by referencing the lower 2 bytes, the lower bytes’ higher 8 bits and lower byte’s lower 8 bits. For example the lower bits of eax would then be addressed as ax, ah and al respectively.

General-Purpose Register alternate names
General-Purpose Register alternate names

In 64-bit mode, there are 16 general purpose registers with a default operand size of 32 bits. These registers are able to work with either 32-bit or 64-bit operands.

In addition to the 8 listed above, the additional registers available in 64-bit mode are R8D, R9D, up to R15D. If 64-bit operand size is specified, the general purpose registers are prefixed by R instead of E; i.e. eax becomes rax etc. They can then be accessed at the byte, word, word and qword level.

Segment Registers

Segment registers hold up to six 16-bit special pointers that identifies a segment in memory, called segment selectors.

Segment registers hold up to six 16-bit special pointers that identifies a segment in memory, called segment selectors.

  • CS – Contains segment selector for the code segment, where the current instructions being executed are stored.
  • DS, ES, FS and GS – Point to four data segments. There are separate data segments to access different types of data structures in an efficient and secure manner. For example there might be data structures for the current module, for data exported from a higher level model and another for data shared with another program.
  • SS – Contains the segment selector for the stack segment, where the function stack is stored from the program task. All stack operations use the SS register to find the stack segment.
Use of segment registers in memory
Use of segment registers in memory

EFLAGS Register

In 32-bit, this register contains a group of status flags, a control flag, and a group of status flags. When the processor is initialised, the state of the EFLAGS register is 00000002H. There are no instructions that allow the whole register to be examined or modified directly.

Instead, some flags can be modified directly using special-purpose instructions. Some of the bits ( 1, 3, 5, 15, 22 – 31) are reserved and their state should not be depended upon.

To move groups of flags to and from the procedure stack or the EAX register, the LAHF, SAHF, PUSHF, PUSHFD, POPF, and POPFD instructions can be used. Using the processor’s bit manipulation instructions (BT, BTS, BTR, BTC) the flags can be examined and modified after they have been moved to the EAX register or to the procedure stack.

When suspending a task, for example during multitasking ,the processor automatically saves the state of the EFLAGS register in the task state segment, TSS.

When binding to  a new task, the processor loads the EFLAGS register with data from the new task’s TSS. The state of this register is also automatically saved in TSS in the event of an interrupt or an exception.

EFLAGS Register
EFLAGS Register

The flags register has the following categories for the various flags:

  • Status Flags
    • indicate the results of arithmetic instructions.
    • only the CF flag can be modified directly
  • System Flags and IOPL field
    • control operating-system or executive operations
    • should not be modified by application programs

The table below lists in detail the flags, their bit positions and a brief description of their functions.

Status Flags Bit Position Description
Carry Flag(CF) 0 Set if the arithmetic operation generates a carry or a borrow out of the most significant bit of the result; cleared otherwise. The flag indicates an overflow condition for unsigned-integer arithmetic
Parity Flag(PF) 2 Set if the least-significant byte of the result contains an even number of 1 bits.
Auxiliary Carry Flag (AF) 4 Set if the arithmetic operation generates a carry or a borrow out of  bit 3 of the result; otherwise cleared. Used in binary-coded decimal (BCD).
Zero Flag (ZF) 6 Set if the result is zero; otherwise cleared.
Sign Flag (SF) 7 Set equal to the most significant bit of the result, the sign bit fo a signed integer. 0 indicates a positive value; 1 a negative value.
Overflow Flag (OF) 11 Set if the integer result is too large excluding the sign-bit to fit in the destination operand. Cleared otherwise.
Direction (DF) Flag 10 Controls string instructions (MOVS, CMPS, SCAS, LODS and STOS). Set and cleared by STD and CLD respectively.
System Flags Bit Position Description
Trap Flag (TF) 8 Set to enable single-step mode for debugging; clear to disable single-step mode.
Interrupt Enable Flag (IF) 9 Controls the response of the processor to maskable interrupt requests. Set to respond to maskable intrerrupts; cleared to inhibit maskable interrupts.
I/O privilege level field (IOPL) 12 & 13 Indicates the I/O privilege level  of the currently running program or task. The current privilege level (CPL) of the currently running program or task must be less than or equal to the I/O privilege level to access the I/O address space. The POPF and IRET instructions can modify this field only when operating at a CPL of 0.
Nested Task Flag (NT) 14 Controls the chaining of interrupted and called tasks. Set when the current task is linked to the previously executed task; cleared when the current task is not linked to another task.
Resume Flag (RF) 16 Controls the processor’s response to debug exceptions.
Virtual-8086 Mode Flag (VM) 17 Set to enable virtual-8086 mode; clear to return to protected mode without virtual-8086 mode semantics.
Alignment Check/ Access Control Flag (AC) 18 If the AM bit is set in the CR0 register, alignment checking of user-mode data accesses is enabled if and only  if this flag is 1.
Virtual Interrupt Flag (VIF) 19 Virtual image of the IF flag. Used in conjunction with the VIP flag.
Virtual Interrupt Pending Flag (VIP) 20 Set to indicate that an interrupt is pending; clear when no interrupt is pending. Set and cleared by the software; processor only reads it. Used in conjunction with the VIF flag.
Identification Flag (ID) 21 The ability of a program to set or clear this flag indicates support for the CPUID instruction. CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers. The instruction’s output is dependent on the contents of the EAX register upon execution.

EIP Register

Also known as the instruction pointer, the EIP register contains the offset in the current code segment for the next instruction to be executed.

It is advanced from one instruction boundary to the next in straight-line code or it is moved ahead or backwards by a number of instructions when executing JMP, Jcc, CALL, RET, and IRET instructions.

The EIP register cannot be accessed directly by software. It is controlled implicitly by control-transfer instructions (JMP, Jcc, CALL and RET), interrupts and exceptions.

It is read by executing a CALL instruction and then reading the value of the return instruction pointer on the procedure stack.

Loading the register can only be done indirectly by modifying the value of a return instruction pointer on the procedure stack and executing a return instruction (RET or IRET).

In 64-bit Mode

The EFLAGS register is extended to 64 bits and is called RFLAGS. The upper 32 bits of RFLAGS are reserved. The lower 32 bits are the same as EFLAGS.

The instruction pointer register is RIP. This register holds the 64-bit offset of the next instruction to be executed.

Processor Instructions

When using the instructions one the intel platform have the format:

label: mnemonic argument1, argument2, argument3

  • label is an identifier followed by a colon
  • mnemonic is a reserved name for a class of instruction opcodes which have the same function
  • arguments1, 2 and 3 are optional; they maybe zero to three depending on the opcode
  • operands can be reserved names of registers or are assumed to be assigned to data items declared in another part of the program
  • when two operands are present in an arithmetic or logical instruction, the right operand is the source and left operand is the destination,e.g.

loadreg: mov eax, subtotal

With loadreg as the label, mov as the mnemonic identifier or an opcode, eax as the destination operand and subtotal as the source operand. In some assembly languages such as AT&T, the source and destination are in reverse order.

Data Representation

Hexadecimal (base 16) digits are followed by the character H, for example 0F8EH. Binary (base 2) numbers are represented by  a string of 1s and 0s, optionally followed by the character B. The appearance of the B after a binary number depends mostly on the need to reduce ambiguity on the number type being represented.

The processor uses byte addressing, therefore the memory is organised and accessed as a sequence of bytes. The range of memory that can be addressed is referred to as an address space.

In a typical scenario where a processor can handle a program that may have many independent address spaces, so called segments, the processor is said to support segmented addressing. This just refers to a program that keeps its code (instructions) and stack in separate segments, as illustrated in the segment register section of this article. To specify a byte address within a segment, the following notation is used:

Segment-register:Byte-address

for example, identifying the byte at address FF79H in the segment pointed to by the DS register;

DS:FF79H

To identify an instruction addressing the code segment;

CS:EIP

where the CS register points to the code segment and the EIP register contains the address of the instruction.

Summary

With this information, it is now less intimidating to read the output given by the various reverse engineering tools introduced in Part 1. Part 3 goes on to show how the process of loading and analysing a binary to understand its inner workings may be carried out.

Categories
C++ C++17

String Views

Starting in C++17, string view classes have been added to the standard library. Available with inclusion of the header <string_view>, string views introduce a way of referring to a constant contiguous sequence of character-like objects. The first element of any such sequence is at position zero.

As the name suggests, a string view provides a way to view a string without having to incur the overhead that normally comes with typical string usage. That is, a string view is only meant to point to an existing string.

Any space occupied should only belong to the pointed to string, not the view.

Usage Example

Use std::string_view whenever you need  query a string without modifying it, for example:


#include <iostream>
#include <string>
auto get_string_length(const std::string &str) -> size_t
{
    return str.length();
}
int main() {
std::cout << "String length is: " << get_string_length("001239283954ABC");
return 0;
}

Should be:

#include <iostream>
#include <string_view>

auto get_string_length(const std::string_view str) -> size_t
{
   return str.length();
}

int main() {
std::cout << "String length is: " << get_string_length("001239283954ABC");
    return 0;
}

Note the difference in the function parameter of get_string_length()
The difference seems trivial, however taking a closer look shows one benefit of the string_view version (click to enlarge):

Debug view of function call using std::string
Function call using std::string

Notice the string allocation when using normal std::string.

Debug view of function call using std::string_view
Function call using std::string_view

Using std::string_view directly does a compile time calculation of the length without doing any allocation.

Benefits

std::string_view is available to use with most of the common std::string functionalities, such as substrings, find, compare amongst others .

The complexity of std::string_view is mostly O(1), whereas you might end up with O(n) for functions such as substr.

In addition to the common char type, string_view also supports wchar_t, char16_t and char32_t types

Caveats

One not nice thing about string_view is the question of lifetime; a string_view is only valid while the string it is pointing to is in scope.

This means that great care has to be taken to avoid ending up with a dangling pointer.

Categories
Reverse Engineering

Getting started with Reverse Engineering binaries Part 1: Introduction

Motivation

Ever wondered how that binary gets created when you finish compiling? I did too. So I decided to find out…

What is a binary anyway?

On the face of it, a binary is just a file that adheres to whatever platform it is built for. On GNU Linux for example, this would be in the Executable and Linking Format (ELF). On the Microsoft Windows family of operating systems this would be the Portable Executable (PE) format for executable image files and Common Object File Format (COFF) of the object files.

Executable Linking Format (ELF)

The ELF format is described in three different formats:

  1. Relocatable file
    • Has code and data applicable to linking with other object files to create an executable or a shared object file
  2. Shared object file 
    • Contains data and code used for processing by the link editor and the dynamic linker.
    • The link editor takes the file, and with other relocatable and shared object files creates another object file.
    • The dynamic linker combines the file with an executable file and other shared object files to create a process image.
  3. Executable file
    • Comprises the code and data intended for execution. The file specifies the creation of a program’s process image.
Side by side view of ELF files for linking and execution.
                Side by side view of ELF files; linking view and execution view.

Portable Executable Format (PE)

The Portable Executable (PE) format was derived from Common object File Format (COFF), which itself was introduced with the Unix System V. The optional part of the PE header(Image Optional Header) differentiates between an executable and an object file.

Image of PE file format.
          Format of PE file.

How do I view it?

On a typical Linux desktop system, you can run programs like xxdreadelf or objdump on a binary to view the above information. These programs have many options though; it is easy to get overwhelmed by the output! Below is an example of running readelf against the program /bin/ls on an Ubuntu system. The option -l is for listing the program headers headers.

Image of readelf command output with the -l flag
Output of readelf command with the -l flag

On a windows desktop, you can install one of the many PE file browsers: PE Explorer, PE View, PEBrowse Professional,  PEBrowse Professional Interactive just to name a few. Below is a screenshot of PEview on one of the internal binaries in windows, C:\Windows\System32\where.exe

Image of PEview on windows
PE file view from PEview

The Story of Memory

The above information is nice and shiny, however understanding it would be challenging without first understanding how a binary actually gets read and executed by the operating system. A process, which is a binary in execution, is loaded when the user runs the program. Typically, processes are mapped into memory in some variation of the diagram below.

image of a process in memory
32bit process in memory

On a 32bit machine, a process has about 232 − 1 (4 GiB) of virtual address space in memory. On a 64bit machine,  this is 264 − 1. Virtual addressing is a concept that allows a process to ‘think’ it has more space than is actually available. Translation from virtual memory addresses to physical addresses is handled by a Memory Management Unit. Thus for each binary loaded, the memory space always seems to be starting from 0x00000000. For the diagram above, the anatomy of the process is further broken down as follows:

  • text
    • Contains executable instructions and is readonly to avoid instruction modification. It is also sharable such that only a single copy is ever in memory for programs that are frequently executed
  • data
    • Further broken down into initialised (.data) and uninitialised (.bss) data
    • .data contains global or static variables which have a pre-defined value and can be modified
    • .bss contains all global variables and static variables that are initialised to zero or do not have explicit initialisation in source code
  • heap
    • Is memory set aside for dynamic allocation. In C and C++ this consists of data created with the malloc or new keywords
    • The heap grows upwards; the higher the address, the more the dynamic memory in use
    • Shared by all threads, shared libraries and dynamically loaded modules called in a given process
  • stack
    • Contains data local to a function.
    • In contrast to the heap; the stack grows downwards; the higher the address, the lower the stack address
    • Data goes out of scope as soon as function terminates

Depending on the platform, there are many other sections and/or segments involved. The above however are typically common across mostplatforms.

Part 1 Summary

Now it should be somewhat clear how a binary gets to be in its usable form. Next,  we look a little closer at how the various parts of the binary get manipulated. With this information it should then be sufficient to get to tackling a typical binary. On to Part 2.