[CUDA] Implement __ldg using intrinsics.
Summary: Previously it was implemented as inline asm in the CUDA headers. This change allows us to use the [addr+imm] addressing mode when executing ld.global.nc instructions. This translates into a 1.3x speedup on some benchmarks that call this instruction from within an unrolled loop. Reviewers: tra, rsmith Subscribers: jhen, cfe-commits, jholewinski Differential Revision: http://reviews.llvm.org/D19990 git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@270150 91177308-0d34-0410-b5e6-96231b3b80d8
Showing
- include/clang/Basic/BuiltinsNVPTX.def 36 additions, 0 deletionsinclude/clang/Basic/BuiltinsNVPTX.def
- lib/CodeGen/CGBuiltin.cpp 45 additions, 0 deletionslib/CodeGen/CGBuiltin.cpp
- lib/Headers/CMakeLists.txt 1 addition, 0 deletionslib/Headers/CMakeLists.txt
- lib/Headers/__clang_cuda_intrinsics.h 256 additions, 0 deletionslib/Headers/__clang_cuda_intrinsics.h
- lib/Headers/__clang_cuda_runtime_wrapper.h 5 additions, 1 deletionlib/Headers/__clang_cuda_runtime_wrapper.h
- test/CodeGen/builtins-nvptx.c 104 additions, 2 deletionstest/CodeGen/builtins-nvptx.c
Loading
Please register or sign in to comment