2008-05-10 22:04:42 -06:00
|
|
|
The Graphics Execution Manager
|
|
|
|
Part of the Direct Rendering Manager
|
|
|
|
==============================
|
|
|
|
|
|
|
|
Keith Packard <keithp@keithp.com>
|
|
|
|
Eric Anholt <eric@anholt.net>
|
|
|
|
2008-5-9
|
|
|
|
|
|
|
|
Contents:
|
|
|
|
|
|
|
|
1. GEM Overview
|
|
|
|
2. API overview and conventions
|
|
|
|
3. Object Creation/Destruction
|
|
|
|
4. Reading/writing contents
|
|
|
|
5. Mapping objects to userspace
|
|
|
|
6. Memory Domains
|
|
|
|
7. Execution (Intel specific)
|
|
|
|
8. Other misc Intel-specific functions
|
|
|
|
|
|
|
|
1. Graphics Execution Manager Overview
|
|
|
|
|
|
|
|
Gem is designed to manage graphics memory, control access to the graphics
|
|
|
|
device execution context and handle the essentially NUMA environment unique
|
|
|
|
to modern graphics hardware. Gem allows multiple applications to share
|
|
|
|
graphics device resources without the need to constantly reload the entire
|
|
|
|
graphics card. Data may be shared between multiple applications with gem
|
|
|
|
ensuring that the correct memory synchronization occurs.
|
|
|
|
|
|
|
|
Graphics data can consume arbitrary amounts of memory, with 3D applications
|
|
|
|
constructing ever larger sets of textures and vertices. With graphics cards
|
|
|
|
memory space growing larger every year, and graphics APIs growing more
|
|
|
|
complex, we can no longer insist that each application save a complete copy
|
|
|
|
of their graphics state so that the card can be re-initialized from user
|
|
|
|
space at each context switch. Ensuring that graphics data remains persistent
|
|
|
|
across context switches allows applications significant new functionality
|
|
|
|
while also improving performance for existing APIs.
|
|
|
|
|
|
|
|
Modern linux desktops include significant 3D rendering as a fundemental
|
|
|
|
component of the desktop image construction process. 2D and 3D applications
|
|
|
|
paint their content to offscreen storage and the central 'compositing
|
|
|
|
manager' constructs the final screen image from those window contents. This
|
|
|
|
means that pixel image data from these applications must move within reach
|
|
|
|
of the compositing manager and used as source operands for screen image
|
|
|
|
rendering operations.
|
|
|
|
|
|
|
|
Gem provides simple mechanisms to manage graphics data and control execution
|
|
|
|
flow within the linux operating system. Using many existing kernel
|
|
|
|
subsystems, it does this with a modest amount of code.
|
|
|
|
|
|
|
|
2. API Overview and Conventions
|
|
|
|
|
|
|
|
All APIs here are defined in terms of ioctls appplied to the DRM file
|
2008-05-12 13:55:36 -06:00
|
|
|
descriptor. To create and manipulate objects, an application must be
|
2008-05-10 22:04:42 -06:00
|
|
|
'authorized' using the DRI or DRI2 protocols with the X server. To relax
|
|
|
|
that, we will need to implement some better access control mechanisms within
|
|
|
|
the hardware portion of the driver to prevent inappropriate
|
|
|
|
cross-application data access.
|
|
|
|
|
|
|
|
Any DRM driver which does not support GEM will return -ENODEV for all of
|
|
|
|
these ioctls. Invalid object handles return -EINVAL. Invalid object names
|
|
|
|
return -ENOENT. Other errors are as documented in the specific API below.
|
|
|
|
|
|
|
|
To avoid the need to translate ioctl contents on mixed-size systems (with
|
|
|
|
32-bit user space running on a 64-bit kernel), the ioctl data structures
|
|
|
|
contain explicitly sized objects, using 64-bits for all size and pointer
|
|
|
|
data and 32-bits for identifiers. In addition, the 64-bit objects are all
|
|
|
|
carefully aligned on 64-bit boundaries. Because of this, all pointers in the
|
|
|
|
ioctl data structures are passed as uint64_t values. Suitable casts will
|
|
|
|
be necessary.
|
|
|
|
|
|
|
|
One significant operation which is explicitly left out of this API is object
|
|
|
|
locking. Applications are expected to perform locking of shared objects
|
|
|
|
outside of the GEM api. This kind of locking is not necessary to safely
|
|
|
|
manipulate the graphics engine, and with multiple objects interacting in
|
|
|
|
unknown ways, per-object locking would likely introduce all kinds of
|
|
|
|
lock-order issues. Punting this to the application seems like the only
|
|
|
|
sensible plan. Given that DRM already offers a global lock on the hardware,
|
|
|
|
this doesn't change the current situation.
|
|
|
|
|
|
|
|
3. Object Creation and Destruction
|
|
|
|
|
|
|
|
Gem provides explicit memory management primitives. System pages are
|
|
|
|
allocated when the object is created, either as the fundemental storage for
|
|
|
|
hardware where system memory is used by the graphics processor directly, or
|
|
|
|
as backing store for graphics-processor resident memory.
|
|
|
|
|
|
|
|
Objects are referenced from user space using handles. These are, for all
|
|
|
|
intents and purposes, equivalent to file descriptors. We could simply use
|
|
|
|
file descriptors were it not for the small limit (1024) of file descriptors
|
|
|
|
available to applications, and for the fact that the X server (a rather
|
|
|
|
significant user of this API) uses 'select' and has a limited maximum file
|
|
|
|
descriptor for that operation. Given the ability to allocate more file
|
|
|
|
descriptors, and given the ability to place these 'higher' in the file
|
|
|
|
descriptor space, we'd love to simply use file descriptors.
|
|
|
|
|
|
|
|
Objects may be published with a name so that other applications can access
|
|
|
|
them. The name remains valid as long as the object exists. Right now, our
|
|
|
|
DRI APIs use 32-bit integer names, so that's what we expose here
|
|
|
|
|
|
|
|
A. Creation
|
|
|
|
|
|
|
|
struct drm_gem_create {
|
|
|
|
/**
|
|
|
|
* Requested size for the object.
|
|
|
|
*
|
|
|
|
* The (page-aligned) allocated size for the object
|
|
|
|
* will be returned.
|
|
|
|
*/
|
|
|
|
uint64_t size;
|
|
|
|
/**
|
|
|
|
* Returned handle for the object.
|
|
|
|
*
|
|
|
|
* Object handles are nonzero.
|
|
|
|
*/
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
create.size = 16384;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_GEM_CREATE, &create);
|
|
|
|
if (ret == 0)
|
|
|
|
return create.handle;
|
|
|
|
|
|
|
|
Note that the size is rounded up to a page boundary, and that
|
|
|
|
the rounded-up size is returned in 'size'. No name is assigned to
|
|
|
|
this object, making it local to this process.
|
|
|
|
|
|
|
|
If insufficient memory is availabe, -ENOMEM will be returned.
|
|
|
|
|
|
|
|
B. Closing
|
|
|
|
|
|
|
|
struct drm_gem_close {
|
|
|
|
/** Handle of the object to be closed. */
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
close.handle = <handle>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_GEM_CLOSE, &close);
|
|
|
|
|
|
|
|
This call makes the specified handle invalid, and if no other
|
|
|
|
applications are using the object, any necessary graphics hardware
|
|
|
|
synchronization is performed and the resources used by the object
|
|
|
|
released.
|
|
|
|
|
|
|
|
C. Naming
|
|
|
|
|
|
|
|
struct drm_gem_flink {
|
|
|
|
/** Handle for the object being named */
|
|
|
|
uint32_t handle;
|
|
|
|
|
|
|
|
/** Returned global name */
|
|
|
|
uint32_t name;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
flink.handle = <handle>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_GEM_FLINK, &flink);
|
|
|
|
if (ret == 0)
|
|
|
|
return flink.name;
|
|
|
|
|
|
|
|
Flink creates a name for the object and returns it to the
|
|
|
|
application. This name can be used by other applications to gain
|
|
|
|
access to the same object.
|
|
|
|
|
|
|
|
D. Opening by name
|
|
|
|
|
|
|
|
struct drm_gem_open {
|
|
|
|
/** Name of object being opened */
|
|
|
|
uint32_t name;
|
|
|
|
|
|
|
|
/** Returned handle for the object */
|
|
|
|
uint32_t handle;
|
|
|
|
|
|
|
|
/** Returned size of the object */
|
|
|
|
uint64_t size;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
open.name = <name>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_GEM_OPEN, &open);
|
|
|
|
if (ret == 0) {
|
|
|
|
*sizep = open.size;
|
|
|
|
return open.handle;
|
|
|
|
}
|
|
|
|
|
|
|
|
Open accesses an existing object and returns a handle for it. If the
|
|
|
|
object doesn't exist, -ENOENT is returned. The size of the object is
|
|
|
|
also returned. This handle has all the same capabilities as the
|
|
|
|
handle used to create the object. In particular, the object is not
|
|
|
|
destroyed until all handles are closed.
|
|
|
|
|
|
|
|
4. Basic read/write operations
|
|
|
|
|
|
|
|
By default, gem objects are not mapped to the applications address space,
|
|
|
|
getting data in and out of them is done with I/O operations instead. This
|
|
|
|
allows the data to reside in otherwise unmapped pages, including pages in
|
|
|
|
video memory on an attached discrete graphics card. In addition, using
|
|
|
|
explicit I/O operations allows better control over cache contents, as
|
|
|
|
graphics devices are generally not cache coherent with the CPU, mapping
|
|
|
|
pages used for graphics into an application address space requires the use
|
|
|
|
of expensive cache flushing operations. Providing direct control over
|
|
|
|
graphics data access ensures that data are handled in the most efficient
|
|
|
|
possible fashion.
|
|
|
|
|
|
|
|
A. Reading
|
|
|
|
|
|
|
|
struct drm_gem_pread {
|
|
|
|
/** Handle for the object being read. */
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
/** Offset into the object to read from */
|
|
|
|
uint64_t offset;
|
|
|
|
/** Length of data to read */
|
|
|
|
uint64_t size;
|
|
|
|
/** Pointer to write the data into. */
|
|
|
|
uint64_t data_ptr; /* void * */
|
|
|
|
};
|
|
|
|
|
|
|
|
This copies data into the specified object at the specified
|
|
|
|
position. Any necessary graphics device synchronization and
|
|
|
|
flushing will be done automatically.
|
|
|
|
|
|
|
|
struct drm_gem_pwrite {
|
|
|
|
/** Handle for the object being written to. */
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
/** Offset into the object to write to */
|
|
|
|
uint64_t offset;
|
|
|
|
/** Length of data to write */
|
|
|
|
uint64_t size;
|
|
|
|
/** Pointer to read the data from. */
|
|
|
|
uint64_t data_ptr; /* void * */
|
|
|
|
};
|
|
|
|
|
|
|
|
This copies data out of the specified object into the
|
|
|
|
waiting user memory. Again, device synchronization will
|
|
|
|
be handled by the kernel to ensure user space sees a
|
|
|
|
consistent view of the graphics device.
|
|
|
|
|
|
|
|
5. Mapping objects to user space
|
|
|
|
|
|
|
|
For most objects, reading/writing is the preferred interaction mode.
|
|
|
|
However, when the CPU is involved in rendering to cover deficiencies in
|
|
|
|
hardware support for particular operations, the CPU will want to directly
|
|
|
|
access the relevant objects.
|
|
|
|
|
|
|
|
Because mmap is fairly heavyweight, we allow applications to retain maps to
|
|
|
|
objects persistently and then update how they're using the memory through a
|
|
|
|
separate interface. Applications which fail to use this separate interface
|
|
|
|
may exhibit unpredictable behaviour as memory consistency will not be
|
|
|
|
preserved.
|
|
|
|
|
|
|
|
A. Mapping
|
|
|
|
|
|
|
|
struct drm_gem_mmap {
|
|
|
|
/** Handle for the object being mapped. */
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
/** Offset in the object to map. */
|
|
|
|
uint64_t offset;
|
|
|
|
/**
|
|
|
|
* Length of data to map.
|
|
|
|
*
|
|
|
|
* The value will be page-aligned.
|
|
|
|
*/
|
|
|
|
uint64_t size;
|
|
|
|
/** Returned pointer the data was mapped at */
|
|
|
|
uint64_t addr_ptr; /* void * */
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
mmap.handle = <handle>;
|
|
|
|
mmap.offset = <offset>;
|
|
|
|
mmap.size = <size>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_GEM_MMAP, &mmap);
|
|
|
|
if (ret == 0)
|
|
|
|
return (void *) (uintptr_t) mmap.addr_ptr;
|
|
|
|
|
|
|
|
|
|
|
|
B. Unmapping
|
|
|
|
|
|
|
|
munmap (addr, length);
|
|
|
|
|
|
|
|
Nothing strange here, just use the normal munmap syscall.
|
|
|
|
|
|
|
|
6. Memory Domains
|
|
|
|
|
|
|
|
Graphics devices remain a strong bastion of non cache-coherent memory. As a
|
|
|
|
result, accessing data through one functional unit will end up loading that
|
|
|
|
cache with data which then needs to be manually synchronized when that data
|
|
|
|
is used with another functional unit.
|
|
|
|
|
|
|
|
Tracking where data are resident is done by identifying how functional units
|
|
|
|
deal with caches. Each cache is labeled as a separate memory domain. Then,
|
|
|
|
each sequence of operations is expected to load data into various read
|
|
|
|
domains and leave data in at most one write domain. Gem tracks the read and
|
|
|
|
write memory domains of each object and performs the necessary
|
|
|
|
synchronization operations when objects move from one domain set to another.
|
|
|
|
|
|
|
|
For example, if operation 'A' constructs an image that is immediately used
|
|
|
|
by operation 'B', then when the read domain for 'B' is not the same as the
|
|
|
|
write domain for 'A', then the write domain must be flushed, and the read
|
|
|
|
domain invalidated. If these two operations are both executed in the same
|
|
|
|
command queue, then the flush operation can go inbetween them in the same
|
|
|
|
queue, avoiding any kind of CPU-based synchronization and leaving the GPU to
|
|
|
|
do the work itself.
|
|
|
|
|
|
|
|
6.1 Memory Domains (GPU-independent)
|
|
|
|
|
|
|
|
* DRM_GEM_DOMAIN_CPU.
|
|
|
|
|
|
|
|
Objects in this domain are using caches which are connected to the CPU.
|
|
|
|
Moving objects from non-CPU domains into the CPU domain can involve waiting
|
|
|
|
for the GPU to finish with operations using this object. Moving objects
|
|
|
|
from this domain to a GPU domain can involve flushing CPU caches and chipset
|
|
|
|
buffers.
|
|
|
|
|
|
|
|
6.1 GPU-independent memory domain ioctl
|
|
|
|
|
|
|
|
This ioctl is independent of the GPU in use. So far, no use other than
|
|
|
|
synchronizing objects to the CPU domain have been found; if that turns out
|
|
|
|
to be generally true, this ioctl may be simplified further.
|
|
|
|
|
|
|
|
A. Explicit domain control
|
|
|
|
|
|
|
|
struct drm_gem_set_domain {
|
|
|
|
/** Handle for the object */
|
|
|
|
uint32_t handle;
|
|
|
|
|
|
|
|
/** New read domains */
|
|
|
|
uint32_t read_domains;
|
|
|
|
|
|
|
|
/** New write domain */
|
|
|
|
uint32_t write_domain;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
set_domain.handle = <handle>;
|
|
|
|
set_domain.read_domains = <read_domains>;
|
|
|
|
set_domain.write_domain = <write_domain>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_GEM_SET_DOMAIN, &set_domain);
|
|
|
|
|
|
|
|
When the application wants to explicitly manage memory domains for
|
|
|
|
an object, it can use this function. Usually, this is only used
|
|
|
|
when the application wants to synchronize object contents between
|
|
|
|
the GPU and CPU-based application rendering. In that case,
|
|
|
|
the <read_domains> would be set to DRM_GEM_DOMAIN_CPU, and if the
|
|
|
|
application were going to write to the object, the <write_domain>
|
|
|
|
would also be set to DRM_GEM_DOMAIN_CPU. After the call, gem
|
|
|
|
guarantees that all previous rendering operations involving this
|
|
|
|
object are complete. The application is then free to access the
|
|
|
|
object through the address returned by the mmap call. Afterwards,
|
|
|
|
when the application again uses the object through the GPU, any
|
|
|
|
necessary CPU flushing will occur and the object will be correctly
|
|
|
|
synchronized with the GPU.
|
|
|
|
|
2008-05-12 14:00:55 -06:00
|
|
|
Note that this synchronization is not required for any accesses
|
|
|
|
going through the driver itself. The pread, pwrite and execbuffer
|
|
|
|
ioctls all perform the necessary domain management internally.
|
|
|
|
Explicit synchronization is only necessary when accessing the object
|
|
|
|
through the mmap'd address.
|
|
|
|
|
2008-05-10 22:04:42 -06:00
|
|
|
7. Execution (Intel specific)
|
|
|
|
|
|
|
|
Managing the command buffers is inherently chip-specific, so the core of gem
|
|
|
|
doesn't have any intrinsic functions. Rather, execution is left to the
|
|
|
|
device-specific portions of the driver.
|
|
|
|
|
|
|
|
The Intel DRM_I915_GEM_EXECBUFFER ioctl takes a list of gem objects, all of
|
|
|
|
which are mapped to the graphics device. The last object in the list is the
|
|
|
|
command buffer.
|
|
|
|
|
|
|
|
7.1. Relocations
|
|
|
|
|
|
|
|
Command buffers often refer to other objects, and to allow the kernel driver
|
|
|
|
to move objects around, a sequence of relocations is associated with each
|
|
|
|
object. Device-specific relocation operations are used to place the
|
|
|
|
target-object relative value into the object.
|
|
|
|
|
|
|
|
The Intel driver has a single relocation type:
|
|
|
|
|
|
|
|
struct drm_i915_gem_relocation_entry {
|
2008-05-12 13:55:36 -06:00
|
|
|
/**
|
2008-05-10 22:04:42 -06:00
|
|
|
* Handle of the buffer being pointed to by this
|
|
|
|
* relocation entry.
|
2008-05-12 13:55:36 -06:00
|
|
|
*
|
2008-05-10 22:04:42 -06:00
|
|
|
* It's appealing to make this be an index into the
|
2008-05-12 13:55:36 -06:00
|
|
|
* mm_validate_entry list to refer to the buffer,
|
|
|
|
* but this allows the driver to create a relocation
|
|
|
|
* list for state buffers and not re-write it per
|
|
|
|
* exec using the buffer.
|
2008-05-10 22:04:42 -06:00
|
|
|
*/
|
|
|
|
uint32_t target_handle;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Value to be added to the offset of the target
|
|
|
|
* buffer to make up the relocation entry.
|
|
|
|
*/
|
|
|
|
uint32_t delta;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Offset in the buffer the relocation entry will be
|
|
|
|
* written into
|
|
|
|
*/
|
|
|
|
uint64_t offset;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Offset value of the target buffer that the
|
|
|
|
* relocation entry was last written as.
|
|
|
|
*
|
|
|
|
* If the buffer has the same offset as last time, we
|
|
|
|
* can skip syncing and writing the relocation. This
|
|
|
|
* value is written back out by the execbuffer ioctl
|
|
|
|
* when the relocation is written.
|
|
|
|
*/
|
|
|
|
uint64_t presumed_offset;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Target memory domains read by this operation.
|
|
|
|
*/
|
|
|
|
uint32_t read_domains;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Target memory domains written by this operation.
|
|
|
|
*
|
|
|
|
* Note that only one domain may be written by the
|
|
|
|
* whole execbuffer operation, so that where there are
|
|
|
|
* conflicts, the application will get -EINVAL back.
|
|
|
|
*/
|
|
|
|
uint32_t write_domain;
|
|
|
|
};
|
|
|
|
|
|
|
|
'target_handle', the handle to the target object. This object must
|
|
|
|
be one of the objects listed in the execbuffer request or
|
|
|
|
bad things will happen. The kernel doesn't check for this.
|
|
|
|
|
|
|
|
'offset' is where, in the source object, the relocation data
|
|
|
|
are written. Each relocation value is a 32-bit value consisting
|
|
|
|
of the location of the target object in the GPU memory space plus
|
|
|
|
the 'delta' value included in the relocation.
|
|
|
|
|
|
|
|
'presumed_offset' is where user-space believes the target object
|
|
|
|
lies in GPU memory space. If this value matches where the object
|
|
|
|
actually is, then no relocation data are written, the kernel
|
|
|
|
assumes that user space has set up data in the source object
|
|
|
|
using this presumption. This offers a fairly important optimization
|
|
|
|
as writing relocation data requires mapping of the source object
|
|
|
|
into the kernel memory space.
|
|
|
|
|
|
|
|
'read_domains' and 'write_domains' list the usage by the source
|
|
|
|
object of the target object. The kernel unions all of the domain
|
|
|
|
information from all relocations in the execbuffer request. No more
|
|
|
|
than one write_domain is allowed, otherwise an EINVAL error is
|
|
|
|
returned. read_domains must contain write_domain. This domain
|
|
|
|
information is used to synchronize buffer contents as described
|
|
|
|
above in the section on domains.
|
|
|
|
|
|
|
|
7.1.1 Memory Domains (Intel specific)
|
|
|
|
|
|
|
|
The Intel GPU has several internal caches which are not coherent and hence
|
|
|
|
require explicit synchronization. Memory domains provide the necessary data
|
|
|
|
to synchronize what is needed while leaving other cache contents intact.
|
|
|
|
|
|
|
|
* DRM_GEM_DOMAIN_I915_RENDER.
|
|
|
|
The GPU 3D and 2D rendering operations use a unified rendering cache, so
|
|
|
|
operations doing 3D painting and 2D blts will use this domain
|
|
|
|
|
|
|
|
* DRM_GEM_DOMAIN_I915_SAMPLER
|
|
|
|
Textures are loaded by the sampler through a separate cache, so
|
|
|
|
any texture reading will use this domain. Note that the sampler
|
|
|
|
and renderer use different caches, so moving an object from render target
|
|
|
|
to texture source will require a domain transfer.
|
|
|
|
|
|
|
|
* DRM_GEM_DOMAIN_I915_COMMAND
|
|
|
|
The command buffer doesn't have an explicit cache (although it does
|
|
|
|
read ahead quite a bit), so this domain just indicates that the object
|
|
|
|
needs to be flushed to the GPU.
|
|
|
|
|
|
|
|
* DRM_GEM_DOMAIN_I915_INSTRUCTION
|
2008-05-12 14:00:55 -06:00
|
|
|
All of the programs on Gen4 and later chips use an instruction cache to
|
|
|
|
speed program execution. It must be explicitly flushed when new programs
|
|
|
|
are written to memory by the CPU.
|
2008-05-10 22:04:42 -06:00
|
|
|
|
|
|
|
* DRM_GEM_DOMAIN_I915_VERTEX
|
|
|
|
Vertex data uses two different vertex caches, but they're
|
|
|
|
both flushed with the same instruction.
|
|
|
|
|
|
|
|
7.2 Execution object list (Intel specific)
|
|
|
|
|
|
|
|
struct drm_i915_gem_exec_object {
|
|
|
|
/**
|
|
|
|
* User's handle for a buffer to be bound into the GTT
|
|
|
|
* for this operation.
|
|
|
|
*/
|
|
|
|
uint32_t handle;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* List of relocations to be performed on this buffer
|
|
|
|
*/
|
|
|
|
uint32_t relocation_count;
|
|
|
|
/* struct drm_i915_gem_relocation_entry *relocs */
|
|
|
|
uint64_t relocs_ptr;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Required alignment in graphics aperture
|
|
|
|
*/
|
|
|
|
uint64_t alignment;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Returned value of the updated offset of the object,
|
|
|
|
* for future presumed_offset writes.
|
|
|
|
*/
|
|
|
|
uint64_t offset;
|
|
|
|
};
|
|
|
|
|
|
|
|
Each object involved in a particular execution operation must be
|
|
|
|
listed using one of these structures.
|
|
|
|
|
|
|
|
'handle' references the object.
|
|
|
|
|
|
|
|
'relocs_ptr' is a user-mode pointer to a array of 'relocation_count'
|
|
|
|
drm_i915_gem_relocation_entry structs (see above) that
|
|
|
|
define the relocations necessary in this buffer. Note that all
|
|
|
|
relocations must reference other exec_object structures in the same
|
|
|
|
execbuffer ioctl and that those other buffers must come earlier in
|
|
|
|
the exec_object array. In other words, the dependencies mapped by the
|
|
|
|
exec_object relocations must form a directed acyclic graph.
|
|
|
|
|
|
|
|
'alignment' is the byte alignment necessary for this buffer. Each
|
|
|
|
object has specific alignment requirements, as the kernel doesn't
|
|
|
|
know what each object is being used for, those requirements must be
|
|
|
|
provided by user mode. If an object is used in two different ways,
|
|
|
|
it's quite possible that the alignment requirements will differ.
|
|
|
|
|
|
|
|
'offset' is a return value, receiving the location of the object
|
|
|
|
during this execbuffer operation. The application should use this
|
|
|
|
as the presumed offset in future operations; if the object does not
|
|
|
|
move, then kernel need not write relocation data.
|
|
|
|
|
|
|
|
7.3 Execbuffer ioctl (Intel specific)
|
|
|
|
|
|
|
|
struct drm_i915_gem_execbuffer {
|
|
|
|
/**
|
2008-05-12 13:55:36 -06:00
|
|
|
* List of buffers to be validated with their
|
2008-05-10 22:04:42 -06:00
|
|
|
* relocations to be performend on them.
|
|
|
|
*
|
|
|
|
* These buffers must be listed in an order such that
|
|
|
|
* all relocations a buffer is performing refer to
|
|
|
|
* buffers that have already appeared in the validate
|
|
|
|
* list.
|
|
|
|
*/
|
|
|
|
/* struct drm_i915_gem_validate_entry *buffers */
|
|
|
|
uint64_t buffers_ptr;
|
|
|
|
uint32_t buffer_count;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Offset in the batchbuffer to start execution from.
|
|
|
|
*/
|
|
|
|
uint32_t batch_start_offset;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Bytes used in batchbuffer from batch_start_offset
|
|
|
|
*/
|
|
|
|
uint32_t batch_len;
|
|
|
|
uint32_t DR1;
|
|
|
|
uint32_t DR4;
|
|
|
|
uint32_t num_cliprects;
|
|
|
|
uint64_t cliprects_ptr; /* struct drm_clip_rect *cliprects */
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
|
|
'buffers_ptr' is a user-mode pointer to an array of 'buffer_count'
|
|
|
|
drm_i915_gem_exec_object structures which contains the complete set
|
|
|
|
of objects required for this execbuffer operation. The last entry in
|
|
|
|
this array, the 'batch buffer', is the buffer of commands which will
|
|
|
|
be linked to the ring and executed.
|
|
|
|
|
|
|
|
'batch_start_offset' is the byte offset within the batch buffer which
|
|
|
|
contains the first command to execute. So far, we haven't found a
|
|
|
|
reason to use anything other than '0' here, but the thought was that
|
|
|
|
some space might be allocated for additional initialization which
|
|
|
|
could be skipped in some cases. This must be a multiple of 4.
|
|
|
|
|
|
|
|
'batch_len' is the length, in bytes, of the data to be executed
|
|
|
|
(i.e., the amount of data after batch_start_offset). This must
|
|
|
|
be a multiple of 4.
|
|
|
|
|
|
|
|
'num_cliprects' and 'cliprects_ptr' reference an array of
|
|
|
|
drm_clip_rect structures that is num_cliprects long. The entire
|
|
|
|
batch buffer will be executed multiple times, once for each
|
|
|
|
rectangle in this list. If num_cliprects is 0, then no clipping
|
|
|
|
rectangle will be set.
|
|
|
|
|
|
|
|
'DR1' and 'DR4' are portions of the 3DSTATE_DRAWING_RECTANGLE
|
|
|
|
command which will be queued when this operation is clipped
|
|
|
|
(num_cliprects != 0).
|
|
|
|
|
|
|
|
DR1 bit definition
|
|
|
|
31 Fast Scissor Clip Disable (debug only).
|
|
|
|
Disables a hardware optimization that
|
|
|
|
improves performance. This should have
|
|
|
|
no visible effect, other than reducing
|
|
|
|
performance
|
|
|
|
|
|
|
|
30 Depth Buffer Coordinate Offset Disable.
|
|
|
|
This disables the addition of the
|
|
|
|
depth buffer offset bits which are used
|
|
|
|
to change the location of the depth buffer
|
|
|
|
relative to the front buffer.
|
|
|
|
|
|
|
|
27:26 X Dither Offset. Specifies the X pixel
|
|
|
|
offset to use when accessing the dither table
|
|
|
|
|
|
|
|
25:24 Y Dither Offset. Specifies the Y pixel
|
|
|
|
offset to use when accessing the dither
|
|
|
|
table.
|
|
|
|
|
|
|
|
DR4 bit definition
|
|
|
|
31:16 Drawing Rectangle Origin Y. Specifies the Y
|
|
|
|
origin of coordinates relative to the
|
|
|
|
draw buffer.
|
|
|
|
|
|
|
|
15:0 Drawing Rectangle Origin X. Specifies the X
|
|
|
|
origin of coordinates relative to the
|
|
|
|
draw buffer.
|
|
|
|
|
|
|
|
As you can see, these two fields are necessary for correctly
|
|
|
|
offsetting drawing within a buffer which contains multiple surfaces.
|
|
|
|
Note that DR1 is only used on Gen3 and earlier hardware and that
|
|
|
|
newer hardware sticks the dither offset elsewhere.
|
|
|
|
|
|
|
|
7.3.1 Detailed Execution Description
|
|
|
|
|
|
|
|
Execution of a single batch buffer requires several preparatory
|
|
|
|
steps to make the objects visible to the graphics engine and resolve
|
|
|
|
relocations to account for their current addresses.
|
|
|
|
|
|
|
|
A. Mapping and Relocation
|
|
|
|
|
|
|
|
Each exec_object structure in the array is examined in turn.
|
|
|
|
|
|
|
|
If the object is not already bound to the GTT, it is assigned a
|
|
|
|
location in the graphics address space. If no space is available in
|
|
|
|
the GTT, some other object will be evicted. This may require waiting
|
|
|
|
for previous execbuffer requests to complete before that object can
|
|
|
|
be unmapped. With the location assigned, the pages for the object
|
|
|
|
are pinned in memory using find_or_create_page and the GTT entries
|
|
|
|
updated to point at the relevant pages using drm_agp_bind_pages.
|
|
|
|
|
|
|
|
Then the array of relocations is traversed. Each relocation record
|
|
|
|
looks up the target object and, if the presumed offset does not
|
|
|
|
match the current offset (remember that this buffer has already been
|
|
|
|
assigned an address as it must have been mapped earlier), the
|
|
|
|
relocation value is computed using the current offset. If the
|
|
|
|
object is currently in use by the graphics engine, writing the data
|
|
|
|
out must be preceeded by a delay while the object is still busy.
|
|
|
|
Once it is idle, then the page containing the relocation is mapped
|
|
|
|
by the CPU and the updated relocation data written out.
|
|
|
|
|
|
|
|
The read_domains and write_domain entries in each relocation are
|
|
|
|
used to compute the new read_domains and write_domain values for the
|
|
|
|
target buffers. The actual execution of the domain changes must wait
|
|
|
|
until all of the exec_object entries have been evaluated as the
|
|
|
|
complete set of domain information will not be available until then.
|
|
|
|
|
|
|
|
B. Memory Domain Resolution
|
|
|
|
|
|
|
|
After all of the new memory domain data has been pulled out of the
|
|
|
|
relocations and computed for each object, the list of objects is
|
|
|
|
again traversed and the new memory domains compared against the
|
|
|
|
current memory domains. There are two basic operations involved here:
|
|
|
|
|
|
|
|
* Flushing the current write domain. If the new read domains
|
|
|
|
are not equal to the current write domain, then the current
|
|
|
|
write domain must be flushed. Otherwise, reads will not see data
|
|
|
|
present in the write domain cache. In addition, any new read domains
|
|
|
|
other than the current write domain must be invalidated to ensure
|
|
|
|
that the flushed data are re-read into their caches.
|
|
|
|
|
|
|
|
* Invaliding new read domains. Any domains which were not currently
|
|
|
|
used for this object must be invalidated as old objects which
|
|
|
|
were mapped at the same location may have stale data in the new
|
|
|
|
domain caches.
|
|
|
|
|
|
|
|
If the CPU cache is being invalidated and some GPU cache is being
|
|
|
|
flushed, then we'll have to wait for rendering to complete so that
|
|
|
|
any pending GPU writes will be complete before we flush the GPU
|
|
|
|
cache.
|
|
|
|
|
|
|
|
If the CPU cache is being flushed, then we use 'clflush' to get data
|
|
|
|
written from the CPU.
|
|
|
|
|
|
|
|
Because the GPU caches cannot be partially flushed or invalidated,
|
|
|
|
we don't actually flush them during this traversal stage. Rather, we
|
|
|
|
gather the invalidate and flush bits up in the device structure.
|
|
|
|
|
|
|
|
Once all of the object domain changes have been evaluated, then the
|
|
|
|
gathered invalidate and flush bits are examined. For any GPU flush
|
|
|
|
operations, we emit a single MI_FLUSH command that performs all of
|
|
|
|
the necessary flushes. We then look to see if the CPU cache was
|
|
|
|
flushed. If so, we use the chipset flush magic (writing to a special
|
|
|
|
page) to get the data out of the chipset and into memory.
|
|
|
|
|
|
|
|
C. Queuing Batch Buffer to the Ring
|
|
|
|
|
|
|
|
With all of the objects resident in graphics memory space, and all
|
|
|
|
of the caches prepared with appropriate data, the batch buffer
|
|
|
|
object can be queued to the ring. If there are clip rectangles, then
|
|
|
|
the buffer is queued once per rectangle, with suitable clipping
|
|
|
|
inserted into the ring just before the batch buffer.
|
|
|
|
|
|
|
|
D. Creating an IRQ Cookie
|
|
|
|
|
|
|
|
Right after the batch buffer is placed in the ring, a request to
|
|
|
|
generate an IRQ is added to the ring along with a command to write a
|
|
|
|
marker into memory. When the IRQ fires, the driver can look at the
|
|
|
|
memory location to see where in the ring the GPU has passed. This
|
|
|
|
magic cookie value is stored in each object used in this execbuffer
|
|
|
|
command; it is used whereever you saw 'wait for rendering' above in
|
|
|
|
this document.
|
|
|
|
|
|
|
|
E. Writing back the new object offsets
|
|
|
|
|
|
|
|
So that the application has a better idea what to use for
|
|
|
|
'presumed_offset' values later, the current object offsets are
|
|
|
|
written back to the exec_object structures.
|
|
|
|
|
|
|
|
|
|
|
|
8. Other misc Intel-specific functions.
|
|
|
|
|
|
|
|
To complete the driver, a few other functions were necessary.
|
|
|
|
|
|
|
|
8.1 Initialization from the X server
|
|
|
|
|
|
|
|
As the X server is currently responsible for apportioning memory between 2D
|
|
|
|
and 3D, it must tell the kernel which region of the GTT aperture is
|
|
|
|
available for 3D objects to be mapped into.
|
|
|
|
|
|
|
|
struct drm_i915_gem_init {
|
|
|
|
/**
|
|
|
|
* Beginning offset in the GTT to be managed by the
|
|
|
|
* DRM memory manager.
|
|
|
|
*/
|
|
|
|
uint64_t gtt_start;
|
|
|
|
/**
|
|
|
|
* Ending offset in the GTT to be managed by the DRM
|
|
|
|
* memory manager.
|
|
|
|
*/
|
|
|
|
uint64_t gtt_end;
|
|
|
|
};
|
|
|
|
/* usage */
|
|
|
|
init.gtt_start = <gtt_start>;
|
|
|
|
init.gtt_end = <gtt_end>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_I915_GEM_INIT, &init);
|
|
|
|
|
|
|
|
The GTT aperture between gtt_start and gtt_end will be used to map
|
|
|
|
objects. This also tells the kernel that the ring can be used,
|
|
|
|
pulling the ring addresses from the device registers.
|
|
|
|
|
|
|
|
8.2 Pinning objects in the GTT
|
|
|
|
|
|
|
|
For scan-out buffers and the current shared depth and back buffers, we need
|
|
|
|
to have them always available in the GTT, at least for now. Pinning means to
|
|
|
|
lock their pages in memory along with keeping them at a fixed offset in the
|
|
|
|
graphics aperture. These operations are available only to root.
|
|
|
|
|
|
|
|
struct drm_i915_gem_pin {
|
|
|
|
/** Handle of the buffer to be pinned. */
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
|
|
|
|
/** alignment required within the aperture */
|
|
|
|
uint64_t alignment;
|
|
|
|
|
|
|
|
/** Returned GTT offset of the buffer. */
|
|
|
|
uint64_t offset;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
pin.handle = <handle>;
|
|
|
|
pin.alignment = <alignment>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_I915_GEM_PIN, &pin);
|
|
|
|
if (ret == 0)
|
|
|
|
return pin.offset;
|
|
|
|
|
|
|
|
Pinning an object ensures that it will not be evicted from the GTT
|
|
|
|
or moved. It will stay resident until destroyed or unpinned.
|
|
|
|
|
|
|
|
struct drm_i915_gem_unpin {
|
|
|
|
/** Handle of the buffer to be unpinned. */
|
|
|
|
uint32_t handle;
|
|
|
|
uint32_t pad;
|
|
|
|
};
|
|
|
|
|
|
|
|
/* usage */
|
|
|
|
unpin.handle = <handle>;
|
|
|
|
ret = ioctl (fd, DRM_IOCTL_I915_GEM_UNPIN, &unpin);
|
|
|
|
|
|
|
|
Unpinning an object makes it possible to evict this object from the
|
|
|
|
GTT. It doesn't ensure that it will be evicted, just that it may.
|
|
|
|
|