spirv-std functions to query compute builtins#535
spirv-std functions to query compute builtins#535fluffysquirrels wants to merge 6 commits intoRust-GPU:move_modfrom
Conversation
|
Give me a little bit more time to work on my version. Don't get me wrong, I think your version is fine, I'm just wondering if we can extract more information from the intrinsics. Like |
|
I was also wondering about safe collection patterns without data races, the equivalent of That is still possible later even if |
|
Here are some thoughts on safe GPU collections. Is there already a discussion thread or issue for these, or should we make one? In my compute project I'm playing with implementing some standard algorithms, e.g. reduce, scan, filter, map. For coalesced access I'm using the normal pattern: each workgroup takes a slice of the total data, for each thread operates on 1 value at a time at index I was thinking a workgroup-level data-race-free API could expose this as a For simple compute grids with multiple workgroups, the collection For prefix scan I'm implementing approximately StreamScan (best described IMO in NVIDIA's "Single-pass Parallel Prefix Scan with Decoupled Look-back" paper, if you don't know it). I use a pattern where each workgroup increments a global atomic to take the next sub-slice of data to work on, and increments another when complete. Once all blocks are assigned, workgroups spin-wait on the completed blocks count to reach the expected final count; this acts as a global barrier before proceeding to the next phase of the algorithm. This seemed sufficiently useful and general to extract into a re-usable algorithm. It would be fantastic for the API to mutably borrow the output collections during such a phase, hopefully guaranteeing data-race-freedom. I was thinking a phase could be modelled like a |
|
There's a bug in the new linker dedupe pass I'm half way through fixing. |
|
Linker builtin de-dupe bug fixed! |
This comment was marked as outdated.
This comment was marked as outdated.
|
I dumped my thoughts here: Rust-GPU/rust-gpu.github.io#96 TLDR:
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
94216a1 to
2a9576c
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
2a9576c to
b62cd33
Compare
b62cd33 to
08ccd9d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
I'm gonna repurpose this PR to add just the compute and subgroup builtin getters, and worry about the graphics stuff in another PR. Which also means I'm merging #539 into this one, to have one complete working implementation, instead of splitting it up into two PRs. |
08ccd9d to
286d803
Compare
0f1b95b to
708a6d6
Compare
708a6d6 to
adcba8d
Compare
|
@fluffysquirrels I may have found an entirely new way to implement the linker fixup... The existing type dedup pass added the name (as in I think this should also be fine with other usages of OpVariable. The key includes all other decorations on the variable, so buffers should work just fine since they have |
adcba8d to
f382cfb
Compare
| /// Query SPIR-V (read-only global) built-in values | ||
| /// | ||
| /// See [module level documentation] on how to use these. | ||
| #[macro_export] |
There was a problem hiding this comment.
I deliberately kept load_builtin! private as an implementation detail. As is, it's very unsafe. Also easy to make something that won't compile but will have an inscrutable error message, or won't validate.
I think this should remain an internal implementation detail.
There was a problem hiding this comment.
Separating out all the builtins into modules in the root makes it a little harder to find them all. I think (especially for compute), someone using the subgroup builtins are also likely to use the compute builtins, so keeping these modules nested under crate::builtin makes sense to me.
In std (and other crates) I've seen people keep all the low level primitives / intrinsics in std::arch or similar. I think that pattern makes sense.
Any higher-level abstraction on top such as safe collections (alloc::Vec in std or whatever parallel collection for spirv here) could then have its own module in the root.
There was a problem hiding this comment.
This won't work for the graphics builtins. Here's a message I sent internally last week about this:
- Easy case: Some built-ins can be "inherited", eg.
local_invocation_idfrom compute shaders are used in mesh shaders and ray gen shaders, cause they fundamentally function like augmented compute shaders. Will likely just define it once for compute shaders and you can just reuse them. - medium case:
gl_PrimitiveID- vertex, tessellation & fragment shaders: an input
- geometry shaders:
gl_PrimitiveIDIn: an inputgl_PrimitiveID: an output
- mesh shaders:
gl_MeshPerPrimitiveEXT[].gl_PrimitiveIDunsized per-primitive output array.
- Hard case:
gl_Position- vertex shader: an output variable that is written to via
gl_Position - tessellation control | evaluation shaders:
gl_in[gl_MaxPatchVertices].gl_Position: a sized arrayed input variable you can read to get the Nth vertex positiongl_Position: an output variable you can write
- geometry shader:
- an unsized arrayed input variable you can read to get the Nth vertex position via
gl_in[].gl_Position gl_Position: an output variable you can write
- an unsized arrayed input variable you can read to get the Nth vertex position via
- vertex shader: an output variable that is written to via
If you split them into one module per shader type (like in #540), you can define the builtin for each shader separately and exactly like is required.
There was a problem hiding this comment.
Yes I meant 1 module per shader type, so crate::builtins::compute::*, crate::builtins::vertex::*, etc. Just as you had in your first pull request and first commit.
As I mentioned elsewhere, but you may have missed, I think output builtin setters would be confusing, as it will be difficult to see in their code when and where they are set.
For compute shaders, I think you always have all the input builtins, whereas e.g. for vertex what is available depends more on what has been configured, is that correct? I also think it would be confusing to not be able to see all the inputs params in one place.
There was a problem hiding this comment.
Moving just the subgroup builtins from subgroup to compute would be confusing, as these builtins are also hidden behind subgroup capabilities. I wonder if #540 should move spirv_std::arch::subgroup to spirv_std::compute::subgroup instead of spirv_std::subgroup? But I feel like that's creating extra module depth without being any useful.
On output setters: I also think they need more design work, which is why I separated out the compute parts and we'll do graphics later. But have you looked at the current function declaration? Cause it's very much not a setter: fn decl_position_out() -> &'static mut Vec4. Calling it declares the position output variable, and returns a mutable reference to write the position to, thus making it unique. This design still has the issue that you can call it multiple times to declare multiple output positions, which is illegal per shader. And incredibly difficult to validate, since we're sorta forced to merge all variables of the same type, as one crate's function could be inlined in two other crates even though it's only one function.
So I'm currently thinking of having these functions marked as unsafe and hide them a little, so they're only used via the new proc macro (to be designed).
Another idea I just had: We may also be better off with a wrapper type like struct Position(&'static mut Vec4) that represents the OpVariable and is passed as an arg to the entry point, which would prevent duplication problems.
There was a problem hiding this comment.
crate::builtins::subgroup::* for the subgroup ones I think is fine if you want them separate from compute, e.g. if they are re-used for other shader types.
Vulkan 1.1 (released 2018) included the subgroup capabilities I believe, so I doubt they are rare enough to be packaged away separately. From a quick check on vulkan.gpuinfo.org reports (from the last 2 years, excluding CPU implementations), 0.4% of reports supported none of the subgroup features.
Output fns: whether it is a direct setter or returns an &mut, the key problem to me is it is not clear if and where the output value is set. If it's an entry point parameter, you can grep or just read what writes that parameter, and if the parameter is passed as an argument to other fns you must inspect those also. If it's written with an output fn, I think you would need to search the whole codebase for use of that output fn, then figure out which paths are actually called.
Fine in small examples, potentially very nasty in a framework. It's the difference between finding usages of a local variable and a global variable across your code base.
having these functions marked as unsafe and hide them a little, so they're only used via the new proc macro
Sounds fine if they're #[doc(hidden)].
wrapper type like
struct Position(&'static mut Vec4)that represents theOpVariableand is passed as an arg to the entry point
Like my first PR? Sounds good to me for a typed unique output variable :)
There was a problem hiding this comment.
For setting a variable at most once, the output wrapper can be opaque and consumed when you set the output: fn set(self: Position, v: Vec4).
There was a problem hiding this comment.
Wait crate::builtins::subgroup::* there are no modules in builtins, they're all in the crate root. So crate::compute contains the compute builtins, crate::subgroup contains all the subgroup intrinsics and the subgroup builtins, and similarly crate::fragment contains kill(), demote_to_helper_invocation() and will contain fragment builtins such as frag_coord() -> Vec4 in future PRs.
There was a problem hiding this comment.
I know you have that now, I was proposing low level stuff is grouped somewhere (as crate::arch is now).
But thinking again I prefer grouping by shader type and use case, as you describe.
As you probably know, ray tracing shaders also have subgroup builtins but with different semantics as the subgroup IDs can change during the invocation.
I think I would like the module docs for crate::compute and crate::raytrace to include "see also: crate::subgroup" or similar, if they are both using the same subgroup builtins there.
Requires #540
This PR adds new API to query compute and subgroup builtins via getter functions:
spirv_std::computewith builtins:local_invocation_index() -> u32local_invocation_id() -> UVec3global_invocation_id() -> UVec3num_workgroups() -> UVec3workgroup_id() -> UVec3gl_WorkgroupSizeequivalent, see belowspirv_std::subgroup:num_subgroups() -> u32subgroup_id() -> u32subgroup_size() -> u32subgroup_invocation_id() -> u32subgroup_*_mask() -> SubgroupMaskabout
gl_WorkgroupSizeIn spirv, there's a
WorkgroupSizebuilt-in which you have to apply not to anOpVariablelike every other built-in but to anOpConstant, and applying built-ins to constants has been deprecated. Instead, they recommend that shader compilers look at theExecutionMode LocalSizeof their compute shader and just return that whengl_WorkGroupSizeis read. glslc literally compiles agl_WorkGroupSizeread into%gl_WorkGroupSize = OpConstantComposite %v3uint %uint_1 %uint_2 %uint_3without even decorating it withWorkgroupSizesince that's been deprecated. But that's not trivial in rust-gpu, since we may have a single module with two compute shaders, and each compute shader may have a different workgroup size. So anfn workgroup_size() -> UVec3intrinsic would need to be somehow specialized per shader, even if it is called through some non-generic function both shaders use. Possible solutions:workgroup_size()to specialize it, but since there's nothing passed to the function by reference, will likely be annoying to implement.OpVaraiablethat is filled with the value ofWorkGroupSizeby the entry point, and that intrinsic just reads it. Would require support forstatics, which I'd honestly like to have for other custom use-cases as well.I don't feel like we need to support this right away and can delay it's implementation.
old
This is a continuation of Firestar99's work in
#459 . I created a new PR only to have somewhere to show my changes.
A small problem: if an entry-point accepts a builtin as a parameter and code is generated for the get function for that parameter, then 2 globals are emitted decorated with the same builtin, and validation fails.
See the failing test
compute_duplicates.SPIR-V output:
compiletest error message:
For ease of use and porting to the new bulitin getters, I'd like to have this compile successfully by de-duplicating the global variable for the builtin. I think I see similar code in
/crates/rustc_codegen_spirv/src/linker/duplicates.rs, is that the right way to handle it? I will try to write this.I also considered building a map between builtin names and result ID's in the assembler and the entry-point codegen. I think that might run faster, but would have higher code complexity and need to touch more places.