2015-04-09

SASで手書き文字データのクラス別主成分を描画する

SAS

はじめに

<a href="http://prunus1350.hatenablog.com/entry/2015/03/10/215627">SASで手書き文字データを可視化する - ぷる日記</a>prunus1350.hatenablog.com
以前このような記事を書きましたが、今回は「0」～「9」それぞれのクラスのデータについて主成分分析を行い、第1主成分を描画してみたいと思います。

コード

実行結果

f:id:Prunus1350:20150418174259p:plain
f:id:Prunus1350:20150418174300p:plain
f:id:Prunus1350:20150418174301p:plain
f:id:Prunus1350:20150418174302p:plain
f:id:Prunus1350:20150418174303p:plain
f:id:Prunus1350:20150418174304p:plain
f:id:Prunus1350:20150418174305p:plain
f:id:Prunus1350:20150418174306p:plain
f:id:Prunus1350:20150418174307p:plain
f:id:Prunus1350:20150418174308p:plain

2015-03-15

SASでロジスティック回帰いろいろ

SAS

はじめに

SASでロジスティック回帰を行う場合、logisticプロシジャを使うのが一般的ですが、他のプロシジャを使ってもパラメータ推定ができるのでやってみました。

データ作成

* 真のパラメータを設定(これらを推定する) ;
%let beta0 = 0.3;
%let beta1 = 0.6;
%let beta2 = 0.5;

data temp;
    call streaminit(1000);
    do x1 = -5 to 5 by 0.2;
        do x2 = -5 to 5 by 0.2;
            p = 1 / (1 + exp(-(&beta0. + &beta1. * x1 + &beta2. * x2)));
            if p <= rand("uniform") then t = 1;
            else                         t = 0;
            output;
        end;
    end;
run;

* 作成したデータを描画 ;
proc sgplot data = temp;
    scatter x = x1 y = x2 / group = t markerattrs=(symbol=circlefilled);
run;

実行結果

f:id:Prunus1350:20150315204132p:plain

logisticプロシジャ

logisticプロシジャを使った基本的なやり方です。

proc logistic data = temp;
    model t = x1 x2;
run;

実行結果(抜粋)

最尤推定値の分析
パラメータ	自由度	推定値	標準誤差	Wald カイ 2 乗	Pr > ChiSq
Intercept	1	0.2904	0.0550	27.9265	<.0001
x1	1	0.6278	0.0256	600.3999	<.0001
x2	1	0.5015	0.0237	447.6218	<.0001

catmodプロシジャ

カテゴリ変数がある場合、logisticプロシジャを使うにはダミー変数化する必要がありますが、catmodプロシジャを使えばカテゴリ変数をそのままmodelステートメントに記述することができます。*1
逆に、連続値の変数はdirectステートメントに記述しておく必要があります。

proc catmod data = temp;
    direct x1 x2;
    model t = x1 x2;
run;
quit;

実行結果(抜粋)

Analysis of Maximum Likelihood Estimates
Parameter	Estimate	Standard Error	Chi- Square	Pr > ChiSq
Intercept	0.2904	0.0550	27.93	<.0001
x1	0.6278	0.0256	600.40	<.0001
x2	0.5015	0.0237	447.62	<.0001

genmodプロシジャ

一般化線形モデル用のgenmodプロシジャを使えば、確率分布に二項分布、リンク関数にロジット関数を指定することでロジスティック回帰を行うことができます。

proc genmod data = temp;
    model t = x1 x2 / dist = binomial link = logit;
run;

実行結果(抜粋)

最大尤度パラメータ推定値の分析
パラメータ	自由度	推定値	標準誤差	Wald 95% 信頼限界		Wald カイ 2 乗	Pr > ChiSq
Intercept	1	0.2904	0.0550	0.1827	0.3981	27.93	<.0001
x1	1	0.6278	0.0256	0.5776	0.6780	600.40	<.0001
x2	1	0.5015	0.0237	0.4550	0.5480	447.62	<.0001
Scale	0	1.0000	0.0000	1.0000	1.0000

genmodプロシジャ(ベイズ推定)

genmodプロシジャでは、bayesステートメントを追記することによってベイズ推定を行うことができます。

proc genmod data = temp;
    model t = x1 x2 / dist = binomial link = logit;
    bayes;
run;

この例ではデフォルト設定で使っていますが、事前分布のパラメータなどを指定することもできます。

実行結果(抜粋)

事後要約
パラメータ	N	平均	標準偏差	パーセント点
パラメータ	N	平均	標準偏差	25%	50%	75%
Intercept	10000	0.2904	0.0552	0.2526	0.2905	0.3273
x1	10000	0.6292	0.0259	0.6113	0.6286	0.6465
x2	10000	0.5026	0.0235	0.4866	0.5024	0.5182

f:id:Prunus1350:20150315214440p:plain f:id:Prunus1350:20150315214450p:plain f:id:Prunus1350:20150315214456p:plain

nlmixedプロシジャ

非線形混合モデル用のnlmixedプロシジャを使えば、二項分布とロジスティック関数を組み合わせることでロジスティック回帰を行うことができます。

proc nlmixed data = temp;
    parms beta0 = 0 beta1 = 0 beta2 = 0;
    theta = 1 - logistic(beta0 + beta1 * x1 + beta2 * x2);
    model t ~ binomial(1, theta);
run;

実行結果(抜粋)

Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper	Gradient
beta0	0.2904	0.05497	2601	5.28	<.0001	0.05	0.1826	0.3982	-0.00013
beta1	0.6278	0.02563	2601	24.50	<.0001	0.05	0.5776	0.6781	-0.00137
beta2	0.5015	0.02371	2601	21.16	<.0001	0.05	0.4550	0.5480	-0.0003

modelステートメントを

    model t ~ binary(theta);

と書いたり、対数尤度関数を記述して

proc nlmixed data = temp;
    parms beta0 = 0 beta1 = 0 beta2 = 0;
    theta = 1 - logistic(beta0 + beta1 * x1 + beta2 * x2);
    llh = t * log(theta) + (1 - t) * log(1 - theta);
    model t ~ general(llh);
run;

とすることもできます。

mcmcプロシジャ

最後にベイズ推定用のmcmcプロシジャを使ってパラメータのベイズ推定を行います。
mcmcプロシジャの書き方はnlmixedプロシジャに似ています。

proc mcmc data = temp nbi = 2000 nmc = 10000;
    parms (beta0 beta1 beta2) = 0;
    prior beta0 beta1 beta2 ~ normal(mean=0, var=10000);
    theta = 1 - logistic(beta0 + beta1 * x1 + beta2 * x2);
    model t ~ binomial(1, theta);
run;

genmodプロシジャのデフォルト設定に合わせてバーンインを2000、サンプリング数を10000に設定しています。

実行結果(抜粋)

Posterior Summaries and Intervals
Parameter	N	Mean	Standard Deviation	95% HPD Interval
beta0	10000	0.2901	0.0545	0.1820	0.3872
beta1	10000	0.6298	0.0253	0.5807	0.6768
beta2	10000	0.5028	0.0233	0.4571	0.5484

f:id:Prunus1350:20150315221910p:plain f:id:Prunus1350:20150315221916p:plain f:id:Prunus1350:20150315221923p:plain

genmodプロシジャの実行結果と合わせたかったのですが、細かいオプションの使い方がまだ分かっていないせいでサンプリングがちょっと上手く行っていないようです。*2

nlmixedプロシジャでの書き換えと同様に、modelステートメントを

    model t ~ binary(theta);

と書いたり、対数尤度関数を記述して

proc mcmc data = temp nbi = 2000 nmc = 10000;
    parms (beta0 beta1 beta2) = 0;
    prior beta0 beta1 beta2 ~ normal(mean=0, var=10000);
    theta = 1 - logistic(beta0 + beta1 * x1 + beta2 * x2);
    llh = t * log(theta) + (1 - t) * log(1 - theta);
    model t ~ general(llh);
run;

とすることもできます。

まとめ

一般化線形モデルの確率分布とリンク関数の関係はみどりぼんを、ロジスティック回帰の対数尤度関数の書き方はぱじパタを読むことで理解できたので、読んでよかったなと思います。

*1:この例では利点がないですね。

*2:分かり次第記事を更新するつもりです。

2015-03-10

SASで手書き文字データを可視化する

SAS

はじめに

Excel方眼紙は手書き文字データを可視化する時に役に立つ。 pic.twitter.com/cQLtCPgTVv
— ぷるうぬす (@Prunus1350) March 3, 2015

先日、Excel方眼紙を使ってkaggleのDigit Recognizerデータの可視化を行いましたが、「やはりこういうことはSASでやるべきなのかな」と思い、やってみました。

コード

kaggleの手書き文字データを可視化する

実行結果

f:id:Prunus1350:20150310212205p:plain

できました。

2015-02-04

SASで「データ解析のための統計モデリング入門」を読む第3章

SAS

まえおき

今回もSASを使って「データ解析のための統計モデリング入門」を読んでいきます。
実行結果の解説は特に行わないので、内容については書籍をご覧下さい。

データ解析のための統計モデリング入門――一般化線形モデル・階層ベイズモデル・MCMC (確率と情報の科学)

作者: 久保拓弥
出版社/メーカー: 岩波書店
発売日: 2012/05/19
メディア: 単行本
購入: 16人クリック: 163回
この商品を含むブログ (21件) を見る

第3章一般化線形モデル(GLM) ―ポアソン回帰―

3.2 観測されたデータの概要を調べる

データ読込

%let home_dir = /folders/myfolders/midoribon/chapter03/;

data d;
    infile "&home_dir.data3a.csv" dsd missover firstobs = 2;
    attrib y length = 8  label = "種子数"
           x length = 8  label = "体サイズ"
           f length = $1 label = "施肥処理"
    ;
    input y x f;
run;

P42 全100個体ぶんのデータを表示

proc print data = d;
run;

実行結果

OBS	y	x	f
1	6	8.31	C
2	6	9.44	C
3	6	9.50	C

...(中略)...

99	7	10.86	T
100	9	9.97	T

P42 列ごとにデータを表示：x列

proc print data = d;
    var x;
run;

実行結果

OBS	x
1	8.31
2	9.44
3	9.50

...(中略)...

99	10.86
100	9.97

P43 列ごとにデータを表示：y列

proc print data = d;
    var y;
run;

実行結果

OBS	y
1	6
2	6
3	6

...(中略)...

99	7
100	9

P43 列ごとにデータを表示：f列

proc print data = d;
    var f;
run;

実行結果

OBS	f
1	C
2	C
3	C

...(中略)...

99	T
100	T

P44 データセットの概要：数値変数

proc means data = d min q1 median mean q3 max;
run;

実行結果

変数	ラベル	最小値	下側四分位点	中央値	平均	上側四分位点	最大値
y	種子数	2.0000000	6.0000000	8.0000000	7.8300000	10.0000000	15.0000000
x	体サイズ	7.1900000	9.4250000	10.1550000	10.0891000	10.7000000	12.4000000

P44 データセットの概要：文字変数

proc freq data = d;
    tables f / missing nocol norow nopercent;
run;

実行結果

施肥処理
f	度数	累積度数
C	50	50
T	50	100

3.3 統計モデリングの前にデータを図示する

P46 図3.2 例題の架空データの図示

proc sgplot data = d;
    scatter x = x y = y / group = f markerattrs = (symbol=circlefilled);
run;

実行結果

f:id:Prunus1350:20150203003938p:plain

P46 図3.3 植物の種子数の分布を、施肥処理でグループわけした箱ひげ図

proc sgplot data = d;
    vbox y / category = f;
run;

実行結果

f:id:Prunus1350:20150203003939p:plain

3.4 ポアソン回帰の統計モデル

P48 図3.4 個体iの平均種子数 $\lambda_i$ と体サイズ $x_i$ の関係

data data3_4;
    do x = -4 to 5 by 0.1;
        lambda1 = exp(-2 - 0.8 * x);
        lambda2 = exp(-1 + 0.4 * x);
        output;
    end;
run;

proc sgplot data = data3_4;
    series x = x y = lambda1 / lineattrs = (pattern=dash);
    series x = x y = lambda2;
    refline 0 / axis = x lineattrs = (pattern=dot);
    xaxis label = "個体iの体のサイズx_i";
    yaxis label = "個体iのλ_i";
run;

実行結果

f:id:Prunus1350:20150203003940p:plain

P50 ポアソン回帰

proc genmod data = d;
    model y = x / dist = poisson
                  link = log;
    ods output parameterestimates = param_est;
run;

実行結果(抜粋)

最大尤度パラメータ推定値の分析
パラメータ	自由度	推定値	標準誤差	Wald 95% 信頼限界		Wald カイ 2 乗	Pr > ChiSq
Intercept	1	1.2917	0.3637	0.5789	2.0045	12.61	0.0004
x	1	0.0757	0.0356	0.0059	0.1454	4.52	0.0336
Scale	0	1.0000	0.0000	1.0000	1.0000

P52 図3.6 パラメーター推定値のばらつきの評価

data _null_;
    set param_est;
    if parameter eq "Intercept" then do;
        call symputx("beta1", estimate);
        call symputx("std_err1", stderr);
    end;
    if parameter eq "x" then do;
        call symputx("beta2", estimate);
        call symputx("std_err2", stderr);
    end;
run;

data param3_6;
    do i = -0.1 to 1.6 by 0.005;
        beta1 = pdf("normal", i, &beta1., &std_err1.);
        beta2 = pdf("normal", i, &beta2., &std_err2.);
        output;
    end;
run;

proc sgplot data = param3_6;
    series x = i y = beta1;
    series x = i y = beta2;
    refline 0 / axis = x;
run;

実行結果

f:id:Prunus1350:20150203003941p:plain

P54 図3.7 平均種子数λの予測

proc sql noprint;
    select min(x), max(x) into :minx, :maxx
        from d;
quit;

data line3_7;
    do x1 = &minx. to &maxx. by 0.1;
        y1 = exp(&beta1. + &beta2. * x1);
        output;
    end;
run;

data d3_7;
    set d line3_7;
run;

proc sgplot data = d3_7;
    scatter x = x y = y / group = f markerattrs = (symbol=circlefilled);
    series x = x1 y = y1 / lineattrs=(color=orange);
run;

実行結果

f:id:Prunus1350:20150203003942p:plain

3.5 説明変数が因子型の統計モデル

P56 説明変数が因子型の統計モデル

proc genmod data = d;
    class f(ref="C");
    model y = f / dist = poisson
                  link = log;
run;

実行結果(抜粋)

適合度評価の基準
基準	自由度	値	値/自由度
デビアンス	98	89.4750	0.9130
Scaled デビアンス	98	89.4750	0.9130
Pearson カイ 2 乗	98	87.0794	0.8886
Scaled Pearson カイ 2 乗	98	87.0794	0.8886
対数尤度		828.4006
完全対数尤度		-237.6273
AIC (小さいほどよい)		479.2545
AICC (小さいほどよい)		479.3782
BIC (小さいほどよい)		484.4649

最大尤度パラメータ推定値の分析
パラメータ		自由度	推定値	標準誤差	Wald 95% 信頼限界		Wald カイ 2 乗	Pr > ChiSq
Intercept		1	2.0516	0.0507	1.9522	2.1509	1637.26	<.0001
f	T	1	0.0128	0.0715	-0.1273	0.1529	0.03	0.8582
f	C	0	0.0000	0.0000	0.0000	0.0000	.	.
Scale		0	1.0000	0.0000	1.0000	1.0000

3.6 説明変数が数量型＋因子型の統計モデル

P58 説明変数が数量型＋因子型の統計モデル

proc genmod data = d;
    class f(ref="C");
    model y = x f / dist = poisson
                    link = log;
run;

実行結果(抜粋)

適合度評価の基準
基準	自由度	値	値/自由度
デビアンス	97	84.8079	0.8743
Scaled デビアンス	97	84.8079	0.8743
Pearson カイ 2 乗	97	83.8215	0.8641
Scaled Pearson カイ 2 乗	97	83.8215	0.8641
対数尤度		830.7341
完全対数尤度		-235.2937
AIC (小さいほどよい)		476.5874
AICC (小さいほどよい)		476.8374
BIC (小さいほどよい)		484.4029

最大尤度パラメータ推定値の分析
パラメータ		自由度	推定値	標準誤差	Wald 95% 信頼限界		Wald カイ 2 乗	Pr > ChiSq
Intercept		1	1.2631	0.3696	0.5386	1.9876	11.68	0.0006
x		1	0.0801	0.0370	0.0075	0.1527	4.67	0.0306
f	T	1	-0.0320	0.0744	-0.1778	0.1138	0.19	0.6670
f	C	0	0.0000	0.0000	0.0000	0.0000	.	.
Scale		0	1.0000	0.0000	1.0000	1.0000

3.7 「何でも正規分布」「何でも直線」には無理がある

データ読込

data d0;
    infile "&home_dir.d0.csv" dsd missover firstobs = 2;
    attrib x length = 8  label = ""
           y length = 8  label = ""
    ;
    input x y;
run;

P61 図3.9(A) 正規分布・恒等リンク関数の統計モデル

proc genmod data = d0;
    model y = x / dist = normal
                  link = id;
    ods output parameterestimates = paramest1;
run;

data _null_;
    set paramest1;
    if parameter eq "Intercept" then call symputx("beta3", estimate);
    if parameter eq "x"         then call symputx("beta4", estimate);
run;

data line3_9a;
    do x1 = 0 to 2 by 0.1;
        y1 = &beta3. + &beta4. * x1;
        output;
    end;
run;

data d0_3_9a;
    set d0 line3_9a;
run;

proc sgplot data = d0_3_9a;
    scatter x = x y = y;
    series x = x1 y = y1 / lineattrs = (pattern=dash);
    refline 0 / axis = y;
    yaxis min = -2;
run;

実行結果

f:id:Prunus1350:20150203003943p:plain

P61 図3.9(B) ポアソン分布・対数リンク関数の統計モデル

proc genmod data = d0;
    model y = x / dist = poisson
                  link = log;
    ods output parameterestimates = paramest2;
run;

data _null_;
    set paramest2;
    if parameter eq "Intercept" then call symputx("beta5", estimate);
    if parameter eq "x"         then call symputx("beta6", estimate);
run;

data line3_9b;
    do x1 = 0 to 2 by 0.1;
        y1 = exp(&beta5. + &beta6. * x1);
        output;
    end;
run;

data d0_3_9b;
    set d0 line3_9b;
run;

proc sgplot data = d0_3_9b;
    scatter x = x y = y;
    series x = x1 y = y1 / lineattrs = (pattern=dash);
    refline 0 / axis = y;
    yaxis min = -2;
run;

実行結果

f:id:Prunus1350:20150203003944p:plain

2015-01-29

SAS DS2 Languageで行列の積

SAS

はじめに

SASで行列演算といえばSAS/IMLがあるんですが、導入していない現場も多いんじゃないでしょうか。
SAS9.4から使えるようになったDS2 Languageを使ってもある程度の行列演算ができると知り、サンプルコードを元に行列の積を求めるマクロを作成しました。
参考にしたページはこちら

SAS(R) 9.4 DS2 Language Reference, Third Edition

マクロを作成するにあたって変更した点は以下の2点。

データセットを行列にする際に、変数名のprefixと配列名を合わせておく必要があるので、その対応
行列サイズをデータセットから取得

コード

DS2 Languageで行列の積

使ってみる

下記のコードで2つのデータセットを用意します

data temp1;
    a = 1; b = 2; c = 3; output;
    a = 4; b = 5; c = 6; output;
run;

data temp2;
    d = 1; e = 2; output;
    d = 3; e = 4; output;
    d = 5; e = 6; output;
run;

temp1.sas7bdat

OBS	a	b	c
1	1	2	3
2	4	5	6

temp2.sas7bdat

OBS	d	e
1	1	2
2	3	4
3	5	6

マクロ実行

%mult_matrix(in1=temp1, in2=temp2, out1=temp3);

実行結果

temp3.sas7bdat

OBS	z1	z2
1	22	28
2	49	64

マクロ実行

%mult_matrix(in1=temp2, in2=temp1, out1=temp4);

実行結果

temp4.sas7bdat

OBS	z1	z2	z3
1	9	12	15
2	19	26	33
3	29	40	51

さいごに

行列の積を求める際にmatrixパッケージのmultメソッドを使っていますが、他にも転置や逆行列を求めるメソッドもあるのでマクロを少し変更するだけでいろんな計算ができそうです。
それにしても行列の積をもとめるだけでこのコードの行数になるとは。。。*1

*1:もっといい書き方があれば修正していきます。

2015-01-19

モンテカルロ法で次元の呪いを体験する

SAS

MCMC講義（伊庭幸人）難易度 - YouTube

を観ていたところ、「(モンテカルロ法で円周率を求めるのは高次元になるとうまく行かなくなるので)一度は必ずやってみるべし！」と言われたのでやってみました。(4:17～)
もちろんSASで。

N次元単位超球の(超)体積

${ \displaystyle V_N=\frac{\pi^{N/2}}{\Gamma(N/2+1)} }$

超球を包む1辺の長さが2の超立方体の(超)体積

${ \displaystyle C_N=2^{N} }$

円周率を求める

コードをシンプルにするために球の中心を原点にとり、すべての次元に対して正の方向のみを考えます。すると、球内部の体積は $\frac{\pi^{N/2}}{2^{N}\cdot\Gamma{(N/2+1)}}$ 、単位立方体の体積は $1$ となります。
この立方体の中に一様ランダムに点を打っていったときに、点を打った数 $(=rep)$ と球の中に点が入ったときの数 $(=cnt)$ の比率が立方体の体積に対する球内部の体積の比率に近くなることが期待できます。
式で書くと、

$\frac{\pi^{N/2}}{2^{N}\cdot\Gamma{(N/2+1)}}:1\simeq cnt:rep$

$\pi$ について整理すると

$\pi\simeq \left( \frac{cnt}{rep}\cdot 2^{N}\cdot\Gamma{(\frac{N}{2}+1)} \right) ^{2/N}$

となります。*1

コード

%macro pi(dim=, rep=);

    data pi;
        do i = 1 to &rep.;
            %do j = 1 %to &dim.;
                x&j. = rand("uniform") ** 2;
            %end;
            dist = sqrt(sum(of x:));
            if dist <= 1 then cnt + 1;
            calc_pi = ((cnt / i) * (2 ** &dim.) * gamma(&dim. / 2 + 1)) ** (2 / &dim.);
            output;
        end;
        put calc_pi = ;
    run;
    
    %let pi = %sysfunc(constant(pi));

    title "&dim.次元";
    proc sgplot data = pi;
        series x = i y = calc_pi;
        refline &pi. / axis = y;
        yaxis min = 0;
    run;
    title;

%mend pi;

%pi(dim=2,  rep=10000);
%pi(dim=3,  rep=10000);
%pi(dim=4,  rep=10000);
%pi(dim=5,  rep=10000);
%pi(dim=6,  rep=10000);
%pi(dim=7,  rep=10000);
%pi(dim=8,  rep=10000);
%pi(dim=9,  rep=10000);
%pi(dim=10, rep=10000);
%pi(dim=11, rep=10000);
%pi(dim=12, rep=10000);
%pi(dim=13, rep=10000);
%pi(dim=14, rep=10000);
%pi(dim=15, rep=10000);

実行結果

f:id:Prunus1350:20150119020714p:plain

f:id:Prunus1350:20150119020715p:plain

f:id:Prunus1350:20150119020716p:plain

f:id:Prunus1350:20150119020717p:plain

f:id:Prunus1350:20150119020718p:plain

f:id:Prunus1350:20150119020719p:plain

f:id:Prunus1350:20150119020720p:plain

f:id:Prunus1350:20150119020721p:plain

f:id:Prunus1350:20150119020722p:plain

f:id:Prunus1350:20150119020723p:plain

f:id:Prunus1350:20150119020724p:plain

f:id:Prunus1350:20150119020725p:plain

f:id:Prunus1350:20150119020726p:plain

f:id:Prunus1350:20150119020727p:plain

どうしてこうなった？

10次元あたりで挙動がおかしくなっているのが分かるかと思います。
単位立方体に対する球内部の体積の比率を調べてみます。*2

コード

data sphere;
    do n = 2 to 15;
        v = (constant("pi") ** (n / 2)) / (2 ** n * gamma(n / 2 + 1));
        output;
    end;
run;

実行結果

次元	体積比
2	0.78540
3	0.52360
4	0.30843
5	0.16449
6	0.08075
7	0.03691
8	0.01585
9	0.00644
10	0.00249
11	0.00092
12	0.00033
13	0.00011
14	0.00004
15	0.00001

次元が高くなると、球の内部に点が打たれる確率が $0$ に近付いているのが分かりました。
次元の呪いですね。

まとめ

f:id:Prunus1350:20150119023235j:plain

*1:次元数 $N$ が奇数のときには $\Gamma{(\frac{1}{2})=\sqrt{\pi}}$ となり右辺からさらに $\pi$ が出てきますが、今回はSASのgamma functionの結果をそのまま使っています。

*2:ここでも正の部分のみを考えています

2015-01-11

SASでWald-Wolfowitz test

SAS

現場ですぐ使える時系列データ分析 ~データサイエンティストのための基礎知識~

作者: 横内大介,青木義充
出版社/メーカー: 技術評論社
発売日: 2014/02/18
メディア: 単行本（ソフトカバー）
この商品を含むブログ (6件) を見る

この本の2-3節に出てくる連の検定(runs test)は、Rではtseriesライブラリのruns.test関数を使えば実行できると書いてあります。
SASではどうやるんだろうかと調べたところ、SAS社のサポートページにコードが書いてあったので試してみました。

33092 - Wald-Wolfowitz (or Runs) test for randomness

こちらのページでは検定の名称がWald-Wolfowitz testと紹介されていたので、このエントリのタイトルもそちらを使いました。
コードはほぼコピペで、データセット名と変数名をマクロのキーワードパラメータとして渡せるようにする部分だけ変更してあります。

使用するデータ

本のサポートページからダウンロードしたデータから、4銘柄の収益率の時系列データ return4 を使います。

time	x5202	x7272	x4927	x4502
1	1.80185	-1.47713	3.31831	-0.55788
2	0.00000	-0.59702	0.03932	-0.13996
3	1.76996	0.89419	-1.74469	-1.98026
4	5.12933	5.34423	1.39029	1.27752
5	3.27898	2.08776	-0.51414	0.14094

以下略

コード

%macro Wald_Wolfowitz_Test(in1=, var1=);

    proc standard data = &in1. out = __m1 mean = 0;
        var &var1.;
    run;

    data __runcount;
        set __m1 nobs = nobs;
        if &var1. eq 0 then delete;
        if &var1. >  0 then n + 1;
        if &var1. <  0 then m + 1;
        retain runs 0 numpos 0 numneg 0;
        previous = lag(&var1.);

        if _n_ eq 1 then do;
            runs = 1;
            prevpos = .;
            currpos = .;
            prevneg = .;
            currneg = .;
        end;
        else do;
            prevpos = (previous > 0);
            currpos = (&var1. > 0 );
            prevneg = (previous < 0);
            currneg = (&var1. < 0);

            if _n_ eq 2 and (currpos and prevpos) then numpos + 1;
            else if _n_ eq 2 and (currpos and prevneg) then numneg + 1;
            else if _n_ eq 2 and (currneg and prevpos) then numpos + 1;
            else if _n_ eq 2 and (currneg and prevneg) then numneg + 1;

            if currpos and prevneg then do;
                runs + 1;
                numpos + 1;
            end;

            if currneg and prevpos then do;
                runs + 1;
                numneg + 1;
            end;
        end;
    run;

    data __runcount;
        set __runcount end = last;
        if last;
    run;

    data __waldwolf;
        label z = 'Wald-Wolfowitz Z'
              pvalue = 'Pr > |Z|';
        set __runcount;
        mu = ((2 * n * m) / (n + m)) + 1;
        sigmasq = ((2 * n * m) * (2 * n * m - (n + m))) / (((n + m) ** 2) * (n + m - 1));
        sigma = sqrt(sigmasq);
        drop sigmasq;

        if N GE 50 then Z = (Runs - mu) / sigma;
        else if Runs - mu LT 0 then Z = (Runs - mu + 0.5) / sigma;
        else Z = (Runs - mu - 0.5) / sigma;

        pvalue = 2 * (1 - probnorm(abs(Z)));
    run;

    title  'Wald-Wolfowitz Test for Randomness';
    title2 'H0: The data are random';
    proc print data = __waldwolf label noobs;
        var z pvalue;
        format pvalue pvalue.;
    run;
    title;
    title2;

%mend Wald_Wolfowitz_Test;

%Wald_Wolfowitz_Test(in1=return4, var1=x5202);
%Wald_Wolfowitz_Test(in1=return4, var1=x7272);
%Wald_Wolfowitz_Test(in1=return4, var1=x4927);
%Wald_Wolfowitz_Test(in1=return4, var1=x4502);

実行結果

Wald-Wolfowitz Z	Pr > \|Z\|
-1.37058	0.1705

Wald-Wolfowitz Z	Pr > \|Z\|
-0.17132	0.8640

Wald-Wolfowitz Z	Pr > \|Z\|
1.65012	0.0989

Wald-Wolfowitz Z	Pr > \|Z\|
-0.12194	0.9029

本と同じ結果が得られました。

はじめに

コード

実行結果

はじめに

データ作成

実行結果

logisticプロシジャ

実行結果(抜粋)

catmodプロシジャ

実行結果(抜粋)

genmodプロシジャ

実行結果(抜粋)

genmodプロシジャ(ベイズ推定)

実行結果(抜粋)

nlmixedプロシジャ

実行結果(抜粋)

mcmcプロシジャ

実行結果(抜粋)

まとめ

はじめに

コード

実行結果

まえおき

第3章 一般化線形モデル(GLM) ―ポアソン回帰―

3.2 観測されたデータの概要を調べる

データ読込

P42 全100個体ぶんのデータを表示

実行結果

P42 列ごとにデータを表示：x列

実行結果

P43 列ごとにデータを表示：y列

実行結果

P43 列ごとにデータを表示：f列

実行結果

P44 データセットの概要：数値変数

実行結果

P44 データセットの概要：文字変数

実行結果

3.3 統計モデリングの前にデータを図示する

P46 図3.2 例題の架空データの図示

実行結果

P46 図3.3 植物の種子数の分布を、施肥処理でグループわけした箱ひげ図

実行結果

3.4 ポアソン回帰の統計モデル

P48 図3.4 個体iの平均種子数と体サイズの関係

実行結果

P50 ポアソン回帰

実行結果(抜粋)

P52 図3.6 パラメーター推定値のばらつきの評価

実行結果

P54 図3.7 平均種子数λの予測

実行結果

3.5 説明変数が因子型の統計モデル

P56 説明変数が因子型の統計モデル

実行結果(抜粋)

3.6 説明変数が数量型＋因子型の統計モデル

P58 説明変数が数量型＋因子型の統計モデル

実行結果(抜粋)

3.7 「何でも正規分布」「何でも直線」には無理がある

データ読込

P61 図3.9(A) 正規分布・恒等リンク関数の統計モデル

実行結果

P61 図3.9(B) ポアソン分布・対数リンク関数の統計モデル

実行結果

はじめに

コード

使ってみる

temp1.sas7bdat

temp2.sas7bdat

マクロ実行

実行結果

temp3.sas7bdat

マクロ実行

実行結果

temp4.sas7bdat

さいごに

N次元単位超球の(超)体積

超球を包む1辺の長さが2の超立方体の(超)体積

円周率を求める

コード

第3章一般化線形モデル(GLM) ―ポアソン回帰―

P48 図3.4 個体iの平均種子数 $\lambda_i$ と体サイズ $x_i$ の関係