日録: 1月 2016

2016年1月18日月曜日

[yacc] %left, %right, %prec

中間演算子等の再帰方向を指定すると共に、優先度付も行う。

%left ID_OPERATOR_LOGICAL_OR
%left ID_OPERATOR_LOGICAL_AND
%left '|'
%left '^'
%left '&'
%left ID_OPERATOR_COMPARE_EQ ID_OPERATOR_COMPARE_NE
%left '>' '<' ID_OPERATOR_COMPARE_LE ID_OPERATOR_COMPARE_GE
%left ID_OPERATOR_LEFT_SHIFT ID_OPERATOR_RIGHT_SHIFT
%left '+' '-'
%left '*' '/' '%'
%left UMINUS

上記は中間オペレータと前置オペレータの優先度を表現している。一番下が優先度が一番高い。

また、中間演算子は全て左再帰になる。

primary_expression : identifier { $$ = new VEAccess(VECode::opAccess, $1); }
| '&' identifier { $$ = new VEAccess(VECode::opAddress, $2); }
| '-' expression %prec UMINUS { $$ = new VEUnaryExpression(VECode::opUnaryMinus, $2); }
| constant
| ID_SIZEOF '(' type_declaration ')' { $$ = new VESizeOf($3); }
| ID_SIZEOF '(' identifier ')' { $$ = new VESizeOf(); }
| '{' expressions '}' { $$ = $2; }
| '(' expression ')' { $$ = $2; }

;

また、単項演算子の'-'に関しては%precにより意味付けをする。

[ruby] 標準出力、標準エラー出力

標準出力、標準エラー出力への出力は以下の通り

STDOUT.puts("A")
STDERR.puts("A")

2016年1月14日木曜日

概要

git同様、分散型リポジトリ

操作

初期化

% hg init
% hg init work

登録

% hg add index.html

コミット

% hg commit

履歴参照

% hg log

取得

% hg update 3

まとめて取り出す

% hg archive -r 3 /tmp/mywork.%h

不要ファイルの削除

登録の必要がないファイルを削除する。

% hg remove

変更の取り消し

hg commit以外の取り消しはrevirtで行う。revirt自体の取り消しは行えない。

% hg revirt

hg commitの取り消しはrollbackで行う

ファイル名付け直し

copyで名前変更

% hg copy

全体の複製

%hg clone

外部との連携

pullは自リポジトリへの反映、pushは外部に反映

%hg pull
%hg push

マージ

複数のバージョンをマージ。

% hg merge

マージの状況を表示

% hg resolve -l

テキストベースでの成果送付

% hg export -o

バイナリベースでの成果送付

% hg bundle
% hg unbundle

[C] C言語BNF

translation_unit : external_decl
| translation_unit external_decl
;
external_decl : function_definition
| decl
;
function_definition : decl_specs declarator decl_list compound_stat
| declarator decl_list compound_stat
| decl_specs declarator compound_stat
| declarator compound_stat
;
decl : decl_specs init_declarator_list ';'
| decl_specs ';'
;
decl_list : decl
| decl_list decl
;
decl_specs : storage_class_spec decl_specs
| storage_class_spec
| type_spec decl_specs
| type_spec
| type_qualifier decl_specs
| type_qualifier
;
storage_class_spec : 'auto' | 'register' | 'static' | 'extern' | 'typedef'
;
type_spec : 'void' | 'char' | 'short' | 'int' | 'long' | 'float'
| 'double' | 'signed' | 'unsigned'
| struct_or_union_spec
| enum_spec
| typedef_name
;
type_qualifier : 'const' | 'volatile'
;
struct_or_union_spec : struct_or_union id '{' struct_decl_list '}'
| struct_or_union '{' struct_decl_list '}'
| struct_or_union id
;
struct_or_union : 'struct' | 'union'
;
struct_decl_list : struct_decl
| struct_decl_list struct_decl
;
init_declarator_list : init_declarator
| init_declarator_list ',' init_declarator
;
init_declarator : declarator
| declarator '=' initializer
;
struct_decl : spec_qualifier_list struct_declarator_list ';'
;
spec_qualifier_list : type_spec spec_qualifier_list
| type_spec
| type_qualifier spec_qualifier_list
| type_qualifier
;
struct_declarator_list : struct_declarator
| struct_declarator_list ',' struct_declarator
;
struct_declarator : declarator
| declarator ':' const_exp
| ':' const_exp
;
enum_spec : 'enum' id '{' enumerator_list '}'
| 'enum' '{' enumerator_list '}'
| 'enum' id
;
enumerator_list : enumerator
| enumerator_list ',' enumerator
;
enumerator : id
| id '=' const_exp
;
declarator : pointer direct_declarator
| direct_declarator
;
direct_declarator : id
| '(' declarator ')'
| direct_declarator '[' const_exp ']'
| direct_declarator '[' ']'
| direct_declarator '(' param_type_list ')'
| direct_declarator '(' id_list ')'
| direct_declarator '(' ')'
;
pointer : '*' type_qualifier_list
| '*'
| '*' type_qualifier_list pointer
| '*' pointer
;
type_qualifier_list : type_qualifier
| type_qualifier_list type_qualifier
;
param_type_list : param_list
| param_list ',' '...'
;
param_list : param_decl
| param_list ',' param_decl
;
param_decl : decl_specs declarator
| decl_specs abstract_declarator
| decl_specs
;
id_list : id
| id_list ',' id
;
initializer : assignment_exp
| '{' initializer_list '}'
| '{' initializer_list ',' '}'
;
initializer_list : initializer
| initializer_list ',' initializer
;
type_name : spec_qualifier_list abstract_declarator
| spec_qualifier_list
;
abstract_declarator : pointer
| pointer direct_abstract_declarator
| direct_abstract_declarator
;
direct_abstract_declarator: '(' abstract_declarator ')'
| direct_abstract_declarator '[' const_exp ']'
| '[' const_exp ']'
| direct_abstract_declarator '[' ']'
| '[' ']'
| direct_abstract_declarator '(' param_type_list ')'
| '(' param_type_list ')'
| direct_abstract_declarator '(' ')'
| '(' ')'
;
typedef_name : id
;
stat : labeled_stat
| exp_stat
| compound_stat
| selection_stat
| iteration_stat
| jump_stat
;
labeled_stat : id ':' stat
| 'case' const_exp ':' stat
| 'default' ':' stat
;
exp_stat : exp ';'
| ';'
;
compound_stat : '{' decl_list stat_list '}'
| '{' stat_list '}'
| '{' decl_list '}'
| '{' '}'
;
stat_list : stat
| stat_list stat
;
selection_stat : 'if' '(' exp ')' stat
| 'if' '(' exp ')' stat 'else' stat
| 'switch' '(' exp ')' stat
;
iteration_stat : 'while' '(' exp ')' stat
| 'do' stat 'while' '(' exp ')' ';'
| 'for' '(' exp ';' exp ';' exp ')' stat
| 'for' '(' exp ';' exp ';' ')' stat
| 'for' '(' exp ';' ';' exp ')' stat
| 'for' '(' exp ';' ';' ')' stat
| 'for' '(' ';' exp ';' exp ')' stat
| 'for' '(' ';' exp ';' ')' stat
| 'for' '(' ';' ';' exp ')' stat
| 'for' '(' ';' ';' ')' stat
;
jump_stat : 'goto' id ';'
| 'continue' ';'
| 'break' ';'
| 'return' exp ';'
| 'return' ';'
;
exp : assignment_exp
| exp ',' assignment_exp
;
assignment_exp : conditional_exp
| unary_exp assignment_operator assignment_exp
;
assignment_operator : '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<='
| '>>=' | '&=' | '^=' | '|='
;
conditional_exp : logical_or_exp
| logical_or_exp '?' exp ':' conditional_exp
;
const_exp : conditional_exp
;
logical_or_exp : logical_and_exp
| logical_or_exp '||' logical_and_exp
;
logical_and_exp : inclusive_or_exp
| logical_and_exp '&&' inclusive_or_exp
;
inclusive_or_exp : exclusive_or_exp
| inclusive_or_exp '|' exclusive_or_exp
;
exclusive_or_exp : and_exp
| exclusive_or_exp '^' and_exp
;
and_exp : equality_exp
| and_exp '&' equality_exp
;
equality_exp : relational_exp
| equality_exp '==' relational_exp
| equality_exp '!=' relational_exp
;
relational_exp : shift_expression
| relational_exp '<' shift_expression
| relational_exp '>' shift_expression
| relational_exp '<=' shift_expression
| relational_exp '>=' shift_expression
;
shift_expression : additive_exp
| shift_expression '<<' additive_exp
| shift_expression '>>' additive_exp
;
additive_exp : mult_exp
| additive_exp '+' mult_exp
| additive_exp '-' mult_exp
;
mult_exp : cast_exp
| mult_exp '*' cast_exp
| mult_exp '/' cast_exp
| mult_exp '%' cast_exp
;
cast_exp : unary_exp
| '(' type_name ')' cast_exp
;
unary_exp : postfix_exp
| '++' unary_exp
| '--' unary_exp
| unary_operator cast_exp
| 'sizeof' unary_exp
| 'sizeof' '(' type_name ')'
;
unary_operator : '&' | '*' | '+' | '-' | '~' | '!'
;
postfix_exp : primary_exp
| postfix_exp '[' exp ']'
| postfix_exp '(' argument_exp_list ')'
| postfix_exp '(' ')'
| postfix_exp '.' id
| postfix_exp '->' id
| postfix_exp '++'
| postfix_exp '--'
;
primary_exp : id
| const
| string
| '(' exp ')'
;
argument_exp_list : assignment_exp
| argument_exp_list ',' assignment_exp
;
const : int_const
| char_const
| float_const
| enumeration_const
;

2016年1月12日火曜日

セレクトした領域の背景色に応じて値を設定

Selectionを使って選択領域にアクセスできる
Selectionの要素はfor each文により個別アクセス可能
背景食はInterior.Colorでアクセス
&Hffffffは白
.Valueによってセル値のアクセスが可能

Sub テスト欄に〇を付ける()
Dim c As Range

For Each c In Selection
If c.Interior.Color = &HFFFFFF Then
c.Value = "〇"
End If
Next c
End Sub

[ruby, win32ole] rubyからExcel起動

既存のExcelファイルを開く

1. requireでwin32oleを取り込む
2. WIN32OLEをExcel.Applicationで開く。Excel対応
3. Visible = trueはExcelを表示するかどうか
4. Workbooks.Open()で既存のExcelファイルを開く
5. Workbooksの下にはシートが配列形式で格納されている。
6. Workdsheets()内のセルをCell()で取り出す。値はValueで取得
7. Excel.Quit()でExcel終了

require 'win32ole'
excel = WIN32OLE.new('Excel.Application')
excel.Visible = true
workbook = excel.Workbooks.Open('C:\Users\hisa\Canon\V4Drv\doc\design\GPDコンフリクトエンジン評価計画書・評価仕様書.xlsx')
sheet = workbook.Worksheets[3]
title = 6
id = sheet.Cells(title, 2).Value
category = sheet.Cells(title, 3).Value
normal = sheet.Cells(title, 4).Value
detail = sheet.Cells(title, 5).Value
file = sheet.Cells(title, 6).Value
expect = sheet.Cells(title, 7).Value
result = sheet.Cells(title, 8).Value
resultRT = sheet.Cells(title, 9).Value
excel.Quit

2016年1月8日金曜日

[ruby] lambda式

lambda式

rubyにもlambda式がある。

lambda式の評価はcallメソッドを使用する

def getInsn(json, predicate)
json.find{|e| e != nil && predicate.call(e)}
end

定義はlambda+ブロックで行う

def noError(json)
(insn = getInsn(json, lambda{|e| e["Name"] == "GetParseErrors" || e["Name"] == "GetParseError"})) != nil &&
(!insn.include?("Error") || insn["Error"] == [])
end

2016年1月7日木曜日

[gcc, spu] spu版gcc最適化問題(6)

2004年作成

スタックの使用での問題

(a) 使用するレジスタ数が多くなってくるとスタックに一旦ストアしますが、その読み込んだ後のレジスタ効率が悪化します。

lqd $21,32($sp)
absdb $28,$108,$78 lqd $2,32($sp)
lqd $15,160($sp)
shufb $107,$17,$105,$101
absdb $105,$58,$76 stqd $94,1648($sp)
absdb $17,$55,$81 stqd $102,2544($sp)
absdb $81,$50,$81 shufb $94,$7,$35,$101
absdb $40,$21,$107 shufb $102,$31,$12,$101
absdb $33,$2,$11 stqd $162,2560($sp)

$21も$2も同じメモリから読み込んでいるので、別のレジスタに割り当てる必要はありません。
以下のように$21に統合可能です。

lqd $21,32($sp)
absdb $28,$108,$78 lqd $15,160($sp)
shufb $107,$17,$105,$101
absdb $105,$58,$76 stqd $94,1648($sp)
absdb $17,$55,$81 stqd $102,2544($sp)
absdb $81,$50,$81 shufb $94,$7,$35,$101
absdb $40,$21,$107 shufb $102,$31,$12,$101
absdb $33,$21,$11 stqd $162,2560($sp)

(b) ストア-ロード処理ではなくコピーで済ませることができる命令がある。

absdb $124,$71,$116 stqd $6,1856($sp)
absdb $59,$36,$82 stqd $72,1536($sp)
shufb $72,$61,$75,$84
lqd $12,240($sp)
lqd $20,32($sp)
lqd $21,1856($sp) ← (*)
absdb $87,$44,$72 stqd $74,592($sp)
nop shufb $74,$31,$102,$111

(*)のロード命令は以下のようにコピー命令に置換できます。

absdb $124,$71,$116 stqd $6,1856($sp)
absdb $59,$36,$82 stqd $72,1536($sp)
ori $21,$72,0 shufb $72,$61,$75,$84
lqd $12,240($sp)
lqd $20,32($sp)
absdb $87,$44,$72 stqd $74,592($sp)
nop shufb $74,$31,$102,$111

il $43, 0 lqd $56,0($88)
:
stqd $43,112($sp) ← 意味なし
:
ah $119,$120,$12 stqd $63,112($sp)
andi $57,$115,15 lqd $18,112($sp)

(d)一つのレジスタで済ませられるものがある。

stqd $32,4816($sp)
nop lqd $27,4816($sp)
clgth $20,$24,$25 lqd $23,4816($sp)
:
shlqbyi $22,$27,2
shufb $14,$15,$79,$16
selb $17,$23,$22,$20 shlqbyi $19,$26,2

$27,$23共に$32に置き換えることが可能です。

clgth $20,$24,$25
:
shlqbyi $22,$32,2
shufb $14,$15,$79,$16
selb $17,$32,$22,$20 shlqbyi $19,$26,2

[ruby] 配列

sort

配列のsortメソッドにて行う。

ブロックを引数にすることも可能。

-

中間演算子。

左辺の配列要素から右辺の配列要素を削除する

検索

要素の存在を確認する場合はinclude?を使用。

合致する情報の検索を行う場合はfindを使用する。

findはブロックを引数にとることが可能

selectは渡されたブロックの条件が成立した要素の配列を返す

map

lispのmapと同じで配列の各要素にブロックの処理を行い、その結果の配列を返す。

collectも同じ

[ruby] json

準備

require 'json'

構文解析

文字列を引数にしてJSONの構文解析を行う

JSON.parse(open(file, &:read))

返された情報は配列はrubyの配列、オブジェクトは連想配列に変換される。

[ruby] ファイル読み込み

ファイルを変数に読み込む

fileで示されたファイルの内容全てを返す。
この方式だと処理後勝手にファイルをクローズする

open(file, &:read)

2016年1月6日水曜日

[gcc, spu] spu版gcc最適化問題(5)

2004年作成

問題を回避する手段

今まで提出した問題の回避方法について記述します。
番号は[SPUコンパイラ最適化項目]と同じものです。

回避可能な問題に関してのみ記述します。

変数のstore処理の無駄

vector 以外のデータ構造のアクセスではどうしてもメモリからデータを読み込み、ch 系の命令でシャッフル用の情報を作り、シャッフルで値を設定しストアします。
vector型を代わりに使用すれば、回避できます。
既にfor文のindex変数等はやっています。

ループ最適化

vector型を代わりに使用すれば、回避できます。

除算

除数が定数の場合、最適化しないという問題ですが、除算を使用せず、シフト命令等で処理すれば回避できます。

整数演算の順序

複数の加算、乗算を行う場合、乗算を前から順に行うと前の演算の結果を待つため、処理が遅れる場合があります。
処理の順序を括弧で指定することで回避することが可能です。

インライン展開時のスタックポインタ処理

インライン関数ではなくマクロで処理する。

変数のstore処理の無駄

vector型を代わりに使用すれば、回避できます。

シャフル用のベクタに関して

const vec_uchar16 vec = (vec_uchar16)
{
0x0c, 0x0d, 0x0e, 0x0f,
0x0c, 0x0d, 0x0e, 0x0f,
0x0c, 0x0d, 0x0e, 0x0f,
0x0c, 0x0d, 0x0e, 0x0f
}

上記のようなベクタはメモリから読み込まないで、コード上で作成されます。
よって、定義していると無駄なデータ領域になります。

コードでは

ilhu $109, 0x0c0d 上位16ビット指定
...
iohl $109, 0x0e0f 下位16ビットをor

で作成されます。

vec_uchar16 l = { 0x00, 0x01, 0x02, 0x03, 0x08, 0x09, 0x0a, 0x0b,
0x10, 0x11, 0x12, 0x13, 0x18, 0x19, 0x1a, 0x1b };
vec_uchar16 r = { 0x04, 0x05, 0x06, 0x07, 0x0c, 0x0d, 0x0e, 0x0f,
0x14, 0x15, 0x16, 0x17, 0x1c, 0x1d, 0x1e, 0x1f };

のような計算で他のベクタが求まるような情報がある場合、

il $4, 0x0404
add $4, $5, $4

にしても命令は増えますが、サイクル数は同じです。さらに定数部の 16 バイトが削れます。
しかし、現在は定数計算をしてしまうようで、結局同じようなコードになってしまいます。

シャッフル用のベクタテーブルは結構多いようなので、一つのモジュールでまとめた方が良いと思います。

[gcc, spu] spu版gcc最適化問題(４)

2004年作成

ループ最適化

L11で始まるブロックはループの中にあるブロックです。

以下の情報をループの外で設定することで高速化が可能です。
1. 161行目にあるdestのようなシンボルのアドレス
2. 140,142,148,150 にある chd 等で取り出す、レジスタのバイト位置情報。(sp はこのブロック内で更新されません)

132 .L18:
133 brz $23,.L6
134 il $30,0
135 hbra .L17,.L11
136 il $29,0
137 nop $127
138 .L11:
139 ai $28,$sp,32
140 lqd $24,64($sp)
141 ai $4,$29,32
142 chd $25,0($sp)
143 a $7,$29,$28
144 chd $20,2($sp)
145 ai $10,$29,14
146 lqx $22,$4,$sp
147 ai $9,$7,6
148 chd $26,4($sp)
149 ai $8,$7,4
150 chd $28,6($sp)
151 ai $16,$9,14
152 lqd $27,0($9)
153 ai $4,$7,2
154 ai $18,$8,14
155 ai $17,$4,14
156 rotqby $13,$22,$10
157 a $11,$30,$30
158 ori $21,$17,0
159 ori $22,$18,0
160 rotqby $14,$27,$16
161 ila $23,dest
162 lqx $12,$11,$31
:
222 .L17:
223 brz $16,.L11
224 ai $sp,$sp,352

[gcc, spu] spu版gcc最適化問題(３)

2004年作成

インライン展開時のスタックポインタ処理

-O2でコンパイルするとinline定義されたものはinline展開します。
展開するにもかかわらずスタックポインタ操作の処理がそのまま残っています。
inline の場合は元になる関数の部分で全てを済ませればいいのですから、全て無駄なコードになっています。

1 _start:
2 ila $7,66051
3 stqd $17,-48($sp)
4 lqa $4,Test1In+16
:
39 stqd $lr,16($sp) # リターンレジスタも積む必要無し
38 a $7,$5,$4
39 stqd $sp,-112($sp) # スタックポインタのセーブ処理
40 sfh $6,$11,$6
41 lnop
42 ah $2,$2,$14
43 lqd $15,0($7)
44 ila $4,q4x4+16
45 lnop
46 sfi $8,$8,-15
47 shufb $10,$2,$6,$10
48 ai $sp,$sp,-112 # スタック拡張
49 shufb $2,$2,$6,$17
50 lqx $16,$5,$4
51 ila $4,66051
52 ai $sp,$sp,112 # スタック戻し
53 shufb $8,$8,$8,$4
54 sfh $4,$10,$2

レジスタ割り当て

callerレジスタの使用方法で無駄な箇所があります。

:

3 stqd $17,-48($sp) # $17 store
4 lqa $4,Test1In+16
5 lqa $17,.LC0 # $17 def
6 lqa $2,Test1In
:
13 lqa $5,mod_six+20
14 shufb $6,$2,$4,$17 # $17 use
15 shufb $2,$2,$4,$10
:
22 ah $6,$6,$2
23 stqd $16,-32($sp) # $16 store
24 stqd $15,-16($sp)
:
49 shufb $2,$2,$6,$17 #
50 lqx $16,$5,$4 # $16 def
51 ila $4,66051
52 ai $sp,$sp,112
53 shufb $8,$8,$8,$4
54 sfh $4,$10,$2
55 lqd $17,-48($sp) # $17 reload
56 ah $2,$2,$10
:
74 mpya $5,$9,$15,$5
75 mpya $6,$2,$16,$6 # $16 use
76 mpyhha $4,$9,$15
77 lqd $15,-16($sp)
78 mpyhha $7,$2,$16 # $16 use
79 lqd $16,-32($sp) # $16 reload

:

$16と$17が実際は一つのレジスタにまとめることができます。
$17 は 3-55, $16 は 23-79 でライブレンジが重なっているように見えますが、store-reload の部分が重なっているので、これは同一レジスタに割り当てることが可能です。

定数

同じ定数を別のレジスタに割り当てている部分があります。
同じレジスタを使用することによって、レジスタのスピル処理を減らすことができます。

下記の例はMEのものです。定数515に関して一つのレジスタにまとめることが出来ます。

1 SubSampleFi4x4:
2 nop $127
3 stqd $42,-448($sp)
4 il $42,-1152
5 stqd $15,-16($sp)
6 stqd $16,-32($sp)
7 stqd $17,-48($sp)
8 stqd $18,-64($sp)
9 stqd $19,-80($sp)
10 stqd $20,-96($sp)
11 stqd $21,-112($sp)
12 stqd $22,-128($sp)
13 stqd $23,-144($sp)
14 stqd $24,-160($sp)
15 stqd $25,-176($sp)
16 ilh $25,515
17 stqd $26,-192($sp)
18 # iprefetch
19 lnop
20 stqd $27,-208($sp)
21 stqd $28,-224($sp)
22 stqd $29,-240($sp)
23 ilh $29,515
24 stqd $30,-256($sp)
25 stqd $31,-272($sp)
26 stqd $32,-288($sp)
27 nop $127
28 stqd $33,-304($sp)
29 ilh $33,515

無駄なレジスタを使用してしまうのは上記の2と同じ問題です。
inline展開を多用しますと、どうしても使用するレジスタ数が増えます。
無駄なレジスタ使用が 1 つあるとスタックへのロード・セーブでパイプラインの P1 が二つ埋まってしまいます。

無駄なnop

下記のトレース結果はme実行時のものです。
85,86にnopがありますが、これは無駄ではないでしょうか。
ただ、nop を削るとスケジュールが崩れる場合があるので、どういう準拠によってnop が挿入されるかわかりません。
89-92 で処理が空いているのは 87 で $26 の準備をしているのが間に合っていないためです。
80 で $10 の設定処理をしている部分と交換すれば、この無駄なサイクルを削ることが出来ます。

D Cycle 80: PC:00178 [P0:il $9,0x0000] PC:0017c [P1:lqa $10,0x015c0]
S Cycle 81: PC:00180 [P1:stqd $0,0x001($1)]
S Cycle 82: PC:00184 [P1:stqd $1,0x3f2($1)]
S Cycle 83: PC:00188 [P0:ai $1,$1,0x320] PC:0018c [P1:lnop ]
S Cycle 84: PC:00190 [P0:il $8,0x0000] PC:00194 [P1:hbra 0x01c,0x0068]
N Cycle 85: PC:00198 [P1:lnop ]
N Cycle 86: PC:0019c [P0:nopi ,0x00007f]
D Cycle 87: PC:001a0 [P0:ila $20,0x01600] PC:001a4 [P1:lqa $26,0x02000]
S Cycle 88: PC:001a8 [P1:lqa $25,0x02010]
R Cycle 89:
R Cycle 90:
R Cycle 91:
R Cycle 92:
S Cycle 93: PC:001ac [P0:mpyh $24,$9,$26]

[gcc, spu] spu版gcc最適化問題(2)

2004年作成

変数を増やした場合、使用するレジスタが少なくなる

以下のようなサンプルプログラムを作って使用するレジスタ数を比較してみました。
プログラムはperlスクリプトで自動生成されます。

% perl t22.pl <変数の数>
struct s {
int a;
int b;
};
struct s a000;
struct s a001;
struct s a002;
:
int f()
{
int a = 0;
int b = 0;
a += (a000.a ^ a001.b);
a += (a001.a ^ a002.b);
:
a += (a145.a ^ a146.b);
a += (a146.a ^ a000.b);
b += (a000.b ^ a146.a);
b += (a146.b ^ a145.a);
:
b += (a002.b ^ a001.a);
b += (a001.b ^ a000.a);
return a + b;
}

このプログラムを-O3でコンパイルして使用するレジスタ数を調査しました。
結果は次のようになります。

変数の数レジスタの数
80 87
90 97
100 107
110 117
120 127
130 127
140 127
147 127
148 123
149 116
150 113
160 86
170 66
180 55
190 43
200 44

変数の数が 147 までは使用できるレジスタを最大限に使うのですが、それからはレ
ジスタを使用しなくなって最終的に 43,44個に落ち着きます。

命令の順番を代えることによってサイクル数を少なくする

1 sub(char *a, char *b, char *c, int n)
2 {
3 int i;
4 5 for (i = 0; i < n; i++) {
6 c[i] = a[i] + b[i];
7 }
8 }

このソースを-O3でコンパイルすると以下のようになります。

6 sub:
7 ori $11,$3,0
8 cgti $3,$6,0
9 ori $12,$6,0
10 ori $10,$4,0
11 ori $9,$5,0
12 hbra .L9,.L6
13 lnop
14 nop $127
15 il $8,0
16 brz $3,.L8
17 .L6:
18 a $4,$11,$8 ; FX even 2
19 lqx $5,$10,$8 ; LS odd 5
20 a $14,$10,$8 ; FX even 2
21 lqx $2,$11,$8 ; LS odd 5
22 ai $6,$4,13 ; FX even 2
23 lqx $13,$9,$8 ; LS odd 5
24 ai $3,$14,13 ; FX even 2
25 cbx $14,$9,$8 ; SH odd 4
26 rotqby $7,$2,$6 ; SH odd 4
27 rotqby $6,$5,$3 ; SH odd 4
28 ah $5,$7,$6
29 shufb $4,$5,$13,$14
30 stqx $4,$9,$8
31 ai $8,$8,1
32 cgt $3,$12,$8
33 .L9:
34 brnz $3,.L6
35 .L8:

(18,19),(20,21),(22,23),(24,25)がそれぞれ同じサイクルで実行されます。
26 で参照している $2 の準備のために 3 サイクル (22,24 で 2 サイクルですが LS では 5 サイクル必要なため) 無駄になります。
19 行目と 21 行目を交換するか、26,27 行目を交換 (この場合、レジスタ $6 の置き換えが必要) をすれば、無駄なサイクルは 2 サイクルに減ります。

上記のような命令順序の調整でサイクル数を削減することが可能です。

整数演算の順序

1 int foo2(int a, int b, int c, int d)
2 {
3 return a + b + c + d;
4 5 }

上記の関数を-O3でコンパイルすると

1 foo2:
2 a $7,$3,$4
3 a $2,$7,$5
4 a $3,$2,$6
5 bi $lr

のようになります。

これは 2 行目の結果を 3 行目で、3 行目の結果を 4 行目で使用するため、二つの無駄なサイクルが発生します。
ここでは (((a+b)+c)+d) の順序で計算していますが、整数の加算・乗算の場合、順序は関係無いので、((a+b)+(c+d)) のように計算しても問題ありません。
この変更によって無駄なサイクルを減らすことが出来ます。(以下のパターンだと無駄なサイクル数が 1 になります)

1 foo2:
2 a $7,$3,$4
3 a $2,$5,$6
4 a $3,$2,$7
5 bi $lr

乗算の場合は加算よりも効果があります。

1 long foo1(long a, long b, long c, long d)
2 {
3 return a * b * c * d;
4 }

上記の関数をコンパイルした場合：

1 foo1:
2 mpyh $14,$3,$4
3 mpyh $7,$4,$3
4 mpyu $13,$3,$4
5 a $12,$14,$7
6 a $11,$12,$13
7 hbr .L2,$lr
8 mpyh $10,$11,$5
9 nop $127
10 mpyh $2,$5,$11
11 mpyu $9,$11,$5
12 a $5,$10,$2
13 a $4,$5,$9
14 mpyh $3,$6,$4
15 mpyh $8,$4,$6
16 mpyu $7,$4,$6
17 a $5,$8,$3
18 a $3,$5,$7

変更後

1 foo1:
2 mpyh $14,$3,$4
3 mpyh $7,$4,$3
4 mpyu $13,$3,$4
5 mpyh $8,$5,$6
6 mpyh $9,$6,$5
7 mpyu $10,$5,$6
8 hbr .L2,$lr
9 a $12,$14,$7
10 a $15,$8,$9
11 a $11,$12,$13
12 a $16,$15,$10
13 mpyh $13,$11,$16
14 mpyh $14,$16,$11
15 mpyu $15,$11,$16
16 a $5,$13,$14
17 a $3,$5,$15

計算回数は代わりませんが、乗算だけを詰めることで無駄なサイクルがなくなります。
この場合、9サイクル早くなりました。括弧で順序付けすることでも最適化されます。

減算、除算等にも使用できます。

無駄な型変換

1 char foo4(char a, char b, char c, char d)
2 {
3 return a + b + c + d;
4 }

上記のソースを-O3でコンパイルすると以下のアセンブラが生成されます。

1 foo4:
2 nop $127
3 hbr .L6,$lr
4 ah $7,$3,$4
5 ah $4,$5,$7
6 ah $2,$6,$4
7 xsbh $3,$2
8 xshw $2,$3
9 ori $3,$2,0
10 nop $127
11 nop $127
12 nop $127
13 .L6:
14 bi $lr

7行目でhalf wordからbyteへの型変換をやっていますので、8行目の処理が無駄です。

除算

定数がある場合。

1 foo(int a)
2 {
3 return a / 3;
4 }

これを-O3でコンパイルすると以下のようになります。
定数部分の最適化がされていません。

1 .file "c.c"
2 .text
3 .align 3
4 .global foo
5 .type foo, @function
6 foo:
7 il $2,3
8 lnop
9 heqi $2,0 不必要
10 hbrr 3f,1f
11 sfi $10,$3,0
12 sfi $11,$2,0
13 cgti $12,$3,-1
14 cgti $13,$2,-1 事前計算可能
15 selb $10,$10,$3,$12
16 selb $11,$11,$2,$13 事前計算可能
17 xor $13,$12,$13
18 clz $9,$11 事前計算可能
19 clz $6,$10
20 il $7,1
21 sf $9,$6,$9
22 ori $4,$10,0
23 shl $6,$11,$9
24 shl $7,$7,$9
25 il $5,0
26 1: clgt $8,$6,$4
27 sf $9,$6,$4
28 or $14,$5,$7
29 rotmi $7,$7,-1
30 rotmi $6,$6,-1
31 selb $4,$9,$4,$8
32 selb $5,$14,$5,$8
33 3: brnz $7,1b
34 2: sfi $10,$4,0
35 sfi $11,$5,0
36 selb $4,$10,$4,$12
37 selb $5,$5,$11,$13
38 ori $3,$5,0
39 nop $127
40 bi $lr

switch文

switch 文は case 毎にコードブロックを作成しテーブルを使ってそこに間接ジャンプするようになっています。
以下の特殊な場合、処理自体をまとめることで、無駄なジャンプを削ることが可能です。
1. 同じ変数への定数代入が多い。
2. 定数のリターン処理が多い。

1 foo(unsigned short a)
2 {
3 switch(a) {
4 case 0: return 1;
5 case 1: return 2;
6 case 2: return 3;
7 case 3: return 4;
8 case 4: return 5;
9 default: return 0;
10 }
11 }

これをコンパイルすると以下のようになります。

6 foo:
7 ila $2,65535
8 il $5,0
9 and $4,$3,$2
10 clgti $3,$4,4
11 nop $127
12 brnz $3,.L1
13 shli $7,$4,2
14 ila $3,.L9
15 a $6,$7,$3
16 lqd $5,0($6)
17 rotqby $4,$5,$6
18 bi $4
19 .align 2
20 .align 2
21 .L9:
22 .word .L3
23 .word .L4
24 .word .L5
25 .word .L6
26 .word .L7
27 .L3:
28 il $5,1
29 br .L1
30 .L4:
31 il $5,2
32 br .L1
33 .L5:
34 il $5,3
35 br .L1
36 .L6:
37 il $5,4
38 br .L1
39 .L7:
40 il $5,5
41 lnop
42 .L1:
43 ori $3,$5,0
44 bi $lr

この場合、case ブロックへのジャンプと case ブロックから脱出するためのジャンプの 2 回ジャンプが存在しますが、以下のようにすると 1 回ですみます。

6 foo:
7 ila $2,65535
8 il $5,0
9 and $4,$3,$2
10 clgti $3,$4,4
11 nop $127
12 brnz $3,.L1
13 shli $7,$4,2
14 ila $3,.L9
15 a $6,$7,$3
16 lqd $5,0($6)
17 .L1:
18 ori $3,$5,0
19 bi $lr
20 .align 2
21 .align 2
22 .L9:
23 .word 1
24 .word 2
25 .word 3
26 .word 4
27 .word 5

無効命令が削除されない

1 foo(int a)
2 {
3 if (a > 0) {
4 return 0;
5 }
6 if (a > 100) {
7 return 100;
8 }
9 return -1;
10 }

上記の命令では 7 行目の処理が行なわれることはありませんが、アセンブラファイルにはその処理が入っています。

6 foo:
7 cgti $9,$3,100
8 hbr .L4,$lr
9 cgti $5,$3,0
10 nop $127
11 ceqi $8,$9,0
12 ceqi $2,$5,0
13 ori $7,$8,100
14 il $6,0
15 selb $5,$6,$7,$2
16 ori $3,$5,0
17 nop $127
18 .L4:
19 bi $lr

マージ処理

条件分岐でまとめられる処理を SELB で一括処理しようとしていますが、しきれていません。

1 foo(int a, int b, int c)
2 {
3 if (a > b) {
4 a++;
5 b++;
6 c++;
7 } else {
8 a++;
9 b--;
10 c++;
11 }
12 return a + b + c;
13 }

上記のソースでは 4 行目と 8 行目、6 行目と 11 行目がそれぞれ同じものになっています。
これをコンパイルすると、以下のようになります。

6 foo:
7 cgt $13,$3,$4
8 hbr .L4,$lr
9 ori $2,$5,0
10 nop $127
11 ai $7,$3,1
12 ceqi $5,$13,0
13 ai $10,$4,-1
14 ai $14,$4,1
15 selb $12,$7,$7,$5
16 selb $4,$14,$10,$5
17 ai $11,$2,1
18 ceqi $8,$13,0
19 a $5,$12,$4
20 selb $7,$11,$11,$8
21 a $3,$5,$7
22 .L4:
23 bi $lr

selb を使ってまとめて処理しようとしていますが、15 行目と 20 行目の処理でもselb を使用しています。この部分はコピー処理と等価です。

[gcc, spu] spu版gcc最適化問題

以下は2004年に解析したもの

$lrの無駄なセーブ

リーフ関数にも関わらず、$lrをセーブしている部分がある。

imm001:
il $2,1 lqa $13,v001.0 il $12,127 stqd $15,-16($sp) il $15,0 hbr .L2,$lr stqd $16,-32($sp) nop $127
il $10,-128 stqd $lr,16($sp) ← 後でロードもしないので無駄
il $9,-1

$lrがいかなる関数にも関わらずスタック上にセーブされる

変数のstore処理の無駄

例題

void f(void)
{
static char c;
c = 1;
}

コンパイル結果

nop $127
hbr .L2, $lr
lnop
lnop
il $2, 1 (1)
lqa $6, c.0 (2)
cbd $3, 0($sp) (3)
nop $127
nop $127
shufb $5, $2, $6, $3 (4)
stqa $5, c.0 (5)
.L2:
bl $lr

(1)～(5)が即値のメモリストア

(1) 即値1をレジスタに設定($2 = 00000001 00000001 00000001 00000001)

(2) $6にcの内容を設定($6 = xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx)

(3) シャッフルバイト用の情報作成($3 = 03111213 14151617 18191a1b 1c1d1e1f)

この情報でシャッフルバイトによる型変換を行ないます。

(4) シャッフルバイト処理。($5 = 01xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx)

(5) $5の内容をストア

メモリのロード・ストアにはqword単位のものしかありません。

そのため、メモリ上で後続の情報を壊さないためにも、一回メモリから qword分読み込んで、その後必要な情報を修正して再度書き直さなければなりません。

但し、qwordでアラインしている場合、この保障は必要なくなります。

nop $127
hbr .L2, $lr
lnop
lnop
ilhu $2, 0x0100 (1)
stqa $2, c.0 (2)
.L2:
bl $lr

上記のようなメモリを保護する処理をしている限り変数を全て 16byte アラインにする必要はないのではないか？

変数のstore処理の無駄(2)

配列、構造体のように連続したメモリ上にある場合、メモリアクセスを一括に処理することで、コードを少なくすることが出来る。

変更前

il $3, 1
lqa $8, c
il $2,2
hbr .L2, $lr
cbd $9, 0($sp)
nop
cbd $7, 1($sp)
shufb $5, $3, $8, $9
stqa $5,c
lqa $6,c+1
shufb $4,$2,$6,$7
stqa $4,c+1
nop
.L2:
bi $lr

変更後

il $3, 1
lqa $8, c
il $2,2
hbr .L2, $lr
cbd $9, 0($sp)
nop
cbd $7, 1($sp)
shufb $5, $3, $8, $9
shufb $4, $2, $5, $7
stqa $4,c
nop
.L2:
bi $lr

レジスタ割り当て

リターンレジスタに絡んだレジスタ割り当てが甘い。

ソース

unsigned short f(unsigned short a, unsigned short b)
{
return a * b;
}

アセンブラ

nop
hbr .L2, $lr
mpy $2, $3, $4
ila $5, 65535
and $4, $2, $5 → and $3, $2, $5
ori $3, $4, 0
nop
nop
nop
nop
nop
.L2:
bi $lr

無駄なnop

乗算をする場合、nopが入ります。

このnopは速度をあげるために必要なようです。(理由は不明)

以下の関数を考えます。定数 11 を欠けることで mpyh 命令等を使うようになります。

[ソース]

int foo(int a)
{
return a * 11;
}

[アセンブラ]

foo:
nop $127 (*)
hbr .L2, $lr
il $4, 11
mpyui $6, $3, 11
mpyh $5, $3, $4
nop $127 (*)
nop $127
nop $127
nop $127
a $3, $5, $6
nop $127 (*)
.L2:
bi $lr

(*) マークの nop を削除すると呼び出し部分も含めた処理が 301 サイクルだったのが 299 サイクルになります。また、この関数のサイズも 48byte から 40byte に削減できます。

これ以上削ると、逆にサイクル数が増えます。

2016年1月18日月曜日

2016年1月14日木曜日

概要

操作

初期化

登録

コミット

履歴参照

取得

タグ

不要ファイルの削除

変更の取り消し

ファイル名付け直し

外部との連携

マージ

テキストベースでの成果送付

バイナリベースでの成果送付

2016年1月12日火曜日

セレクトした領域の背景色に応じて値を設定

既存のExcelファイルを開く

2016年1月8日金曜日

lambda式

2016年1月7日木曜日

スタックの使用での問題

sort

-

検索

map

準備

構文解析

ファイルを変数に読み込む

2016年1月6日水曜日

問題を回避する手段

変数のstore処理の無駄

ループ最適化

除算

整数演算の順序

インライン展開時のスタックポインタ処理

変数のstore処理の無駄

シャフル用のベクタに関して

ループ最適化

2004年作成

インライン展開時のスタックポインタ処理

レジスタ割り当て

定数

無駄なnop

変数を増やした場合、使用するレジスタが少なくなる

命令の順番を代えることによってサイクル数を少なくする

整数演算の順序

無駄な型変換

除算

switch文

無効命令が削除されない

マージ処理

$lrの無駄なセーブ

変数のstore処理の無駄

変数のstore処理の無駄(2)

レジスタ割り当て

無駄なnop