source: trunk/lib_ir/beginner_task.md

Last change on this file was 4326, checked in by linmengl, 5 years ago

cont. beginner's task

File size: 3.8 KB
RevLine 
[4315]1Beginner's task to the IR library
2=================================
3
[4326]4Welcome to the beginner's task. Through this task, you will get familiar with the LLVM tools. The goal of this task is to find the implementation of the inverse transposition.
[4315]5
6## Setup
7
81. Clone the repository of [Parabix-LLVM](http://parabix.costar.sfu.ca/svn/parabix-LLVM) and follow the instructions in README.md.
9
102. Build `lib_ir`:
11
12    cd build
13    make check
14
15## Find the optimized IR library
16
[4326]17LLVM IR source files (`s2p.ll`, `p2s.ll` and `s2p_ideal.ll`) are linked together into `ir_impl.bc` by cmake. You can find this file in the `build` directory. LLVM IR can reside in two types of files: `.ll` and `.bc`. `.ll` is its text format for human and `.bc` is its bitcode format for compressed storage.
[4315]18
[4326]19To have a look at `ir_impl.bc`, dis-assemble it with:
[4315]20
21    llvm-dis ir_impl.bc -o ir_impl.ll
22
23Now open `ir_impl.ll` in any text editor and search for `p2s_bytemerge_ir`. This is the main function used for the inverse transposition. Note all the `call` there.
24
25Our Makefile also does optimization on `ir_impl.bc`. The result is in `ir_impl_opt.bc`. Have a look at its content:
26
27    llvm-dis ir_impl_opt.bc -o ir_impl_opt.ll
28
29Now open `ir_impl_opt.ll`, find `p2s_bytemerge_ir` and you will see all the function calls are inlined now.
30
31## Find the assembly code for `mergeh_8` and `mergel_8`
32
33For the inverse transposition, to get a good performance you need to generate the right machine code for `mergeh_8` and `mergel_8`.
34
35Look at what assembly code LLVM generates is important. Let''s do this by typing:
36
37    llc-svn -O3 -mattr=+sse2 ir_impl_opt.bc
38
39BTW, if you are curious about the Haswell assembly, you can type:
40
41    llc-svn -O3 -mattr=+avx2,+sse2,+bmi2 ir_impl_opt.bc
42
[4326]43By using `llc`, we compile LLVM IR file `ir_impl_opt.bc` into naive machine assembly. You can see how we explicitly tell `llc` to build with AVX2, SSE2 and BMI2.
[4315]44
45Now open `ir_impl_opt.s` and search for `mergeh_8`. You will find the following piece of code:
46
47    mergeh_8:                               # @mergeh_8
48    # BB#0:                                 # %entry
49        punpckhbw   %xmm0, %xmm1
50        movdqa  %xmm1, %xmm0
51        retl
52    .Ltmp14:
53        .size   mergeh_8, .Ltmp14-mergeh_8
54
55Seems good enough. How about `p2s_step_ir`. Oh `pextrw` spotted!
56
57    pextrw  $7, %xmm1, %edx
58    pextrw  $7, %xmm3, %eax
59    movl    %eax, 28(%esp)          # 4-byte Spill
60    pextrw  $3, %xmm1, %eax
61    movl    %eax, 24(%esp)          # 4-byte Spill
62    pextrw  $3, %xmm3, %eax
63    movl    %eax, 44(%esp)          # 4-byte Spill
64    ...
65
66`pextrw` extracts field from a SIMD register and it is always the sign of scalarization. It comes from this line of code in `ir_impl_opt.ll`:
67
68    %r0.i = lshr <8 x i16> %aa.i, %shift_mask
69
[4326]70But, is this the real performance bottleneck?
[4315]71
[4326]72For `p2s_step_ir`, the `shift_mask` is a variable that is only available in the run time, so LLVM could not assume anything about the shifting amount. It decides to scalarize this `<8 x i16>` vector to meet the need of the hardest case: arbitrary amount for each field.
73
74However, every time we call `p2s_step_ir`, we call it with a constant `shift_mask`. These constants are propagated onto the inlined `p2s_step_ir` functions. The clue of this propagation is in `p2s_bytemerge_ir`. Find this function in `ir_impl_opt.ll` and you will see the following code:
75
76    %aa.i.i181 = bitcast <4 x i32> %p4 to <8 x i16>
77    %r0.i.i182 = lshr <8 x i16> %aa.i.i181, <i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4, i16 4>
78
79The `shift_mask` is replaced with a constant vector. LLVM then recognizes this `lshr` as an immediate shifting (shifting with the same amount for each of the fields). Immediate shifting can be compiled into better assembly code such as `psllw`. This explains why we can't find `pextrw` in the assembly of `p2s_bytemerge_ir`.
80
Note: See TracBrowser for help on using the repository browser.