statistics / hellog～英語史ブログ

最終更新時間: 2026-07-15 01:27

2012-10-31 Wed

■ #1283. 共起性の計算法 [corpus][statistics][bnc][collocation][lltest]

　[2010-03-04-1]の記事「#311. girl とよく collocate する形容詞は何か」で，語と語の共起 (collocation) を測る計算法 (association measure) にはいくつかの種類があることを見た．コーパス言語学では，Log-Likelihood Test という検定にかかわる計算法が比較的よく使われているが，それぞれの計算法には特徴があるので，なるべく複数の方法を試すのがよい．今回は[2010-03-04-1]の内容と重複する部分もあるが，BNCweb で実装されている7種類の計算法の各々について Hoffmann et al. (149--58) を参照しながら，特徴および利用のヒントを示したい．
　各種の計算法は，(a) 共起頻度 (frequency of co-occurrence)，(b) 共起有意性 (significance of co-occurrence)，(c) エフェクト・サイズ (effect-size) の1つ，あるいは複数の組み合わせに基づいている．(b) は，共起が統計的に有意であるとの確信度を表わす指標であり，共起の強さを表わすものではないことに注意する必要がある．(c) は，観察頻度と期待頻度との比を計算の基本とする指標である．

　(1) Rank by frequency
　　観察される共起頻度そのものを用いる，最も単純で直感的な尺度．他の計算法のような複雑な統計処理はほどこされておらず，指標としては最も粗い．機能語や句読記号などが上位に来ることが多い．通常の共起分析には用いられない．

　(2) Log-likelihood
　　共起有意性を用いる．BNCweb のデフォルトの計算法で，コーパス研究で広く用いられている．機能語や句読記号などの極めて高頻度の語との共起や，逆に極めて低頻度の語（1, 2回など）との共起をはじく傾向がある．しかし，共起頻度の高い組み合わせに高得点を与えるという特徴があり，解釈には注意を要する．

　(3) Mutual information (MI)
　　エフェクト・サイズを用いる．非常によく用いられている計算法だが，利用に当たっては多くの注意を要する．機能語や句読記号などとのありふれた共起を効果的に排除してくれる点はよいが，反面，低頻度の共起表現への偏りが激しい．この偏りの影響を減じるために，BNCweb では "Freq(node, collocate) at least" を10以上に設定することが推奨される．これにより，"conspicuous and intuitively appealing collocations involving words of intermediate frequency" (Hoffmann et al. 154) が浮き彫りとなる．

　(4) T-score
　　共起頻度と共起有意性を考慮する計算法．期待頻度が1以下程度の稀な共起表現については Rank by frequency と似たような振る舞いをし，頻度の高い共起表現については共起有意性を反映した振る舞いをする．また，観察頻度が期待頻度よりも必ず高くなる．Log-likelihood と類似した結果となることが多いが，高頻度へのバイアスは一層強くなる．ノードそのものが1000回を大きく下回る場合に，効果を発揮することがある．

　(5) Z-score
　　共起有意性とエフェクト・サイズを考慮する計算法．高頻度の共起表現にはエフェクト・サイズをより重視するが，低頻度の共起表現にはそこまでエフェクト・サイズに寄りかからない．Log-likelihood と MI の両特徴を兼ね備えたような，バランスの取れた指標である．ただし，MI と同様に，低頻度の共起表現へのバイアスがみられるので，"Freq(node, collocate) at least" を5程度に設定するのがよいとされる．

　(6) MI3
　　共起頻度とエフェクト・サイズを考慮する計算法．MI のもつ低頻度表現への偏重を取り除くべく改善されている．低頻度共起表現にはエフェクト・サイズが，高頻度共起表現には共起頻度が，比較的よく反映される．POS による限定とともに用いると効果的．複数語からなる用語などの取り出しに威力を発揮する．しかし，全体としては高頻度共起表現へのバイアスが強く，一般的な共起分析には向かない．

　(7) Dice coefficient
　　MI3 と同様に，共起頻度とエフェクト・サイズを考慮する計算法．しかし，MI3と異なり，低頻度共起表現には共起頻度が，高頻度共起表現にはエフェクト・サイズがよく反映され，両者の切り替えが急なのが特徴的である．切り替えは，ノードそのものの頻度が共起表現の頻度の10倍ほどの点で起こるとされる．経験的に，Z-score と似たような結果が得られるが，Z-score ほど頻度に基づくバイアスが見られない．

　以上のように多種類あって目移りするが，Hoffmann et al. の見解によれば，単一基準の計算法としては Log-likelihood と MI がお勧めで，混合基準の計算法としては Z-score と Dice がお勧めとのことである．
　共起性の様々な計算法については，Association measures を参照．

　・ Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee, and Ylva Berglund Prytz. Corpus Linguistics with BNCweb : A Practical Guide. Frankfurt am Main: Peter Lang, 2008.

Referrer (Inside): [2019-07-10-1]

Decade	1510	1520	1530	1540	1550	1560	1570	1580	1590	1600	1610	1620	1630	1640	1650	1660	1670	1680	1690	1700
New words	409	508	1415	1400	1609	1310	1548	1876	1951	3300	2710	2281	1688	1122	1786	1973	1370	1228	974	943

	SOED (80,096 words)	ALD (27,241 words)	GSL (3,984 words)
West Germanic	22.20%	27.43%	47.08%
French	28.37%	35.89%	38.00%
Latin	28.29%	22.05%	9.59%
Greek	5.32%	1.59%	0.25%
Other Romance	1.86%	1.60%	0.20%
Celtic	0.34%	0.25%	---

	Post-positive genitive	'Periphrastic' genitive	Pre-positive genitive
c. 900	47.5%	0.5%	52.0%
c. 1000	30.5%	1.0%	68.5%
c. 1100	22.2%	1.2%	76.6%
c. 1200	11.8%	6.3%	81.9%
c. 1250	0.6%	31.4%	68.9%
c. 1300	0.0%	84.5%	15.6%

	c. 900	c. 1000	c. 1100	c. 1200	c. 1250
Genitive before its noun	52.4%	69.1%	77.4%	87.4%	99.1%
Genitive after its noun	47.6%	30.9%	22.6%	12.6%	0.9%

OE Corpus (900--1000)	Dative-object before acc-obj.	Dative-object after acc-obj.
Nouns	249 (64.0%)	140 (36.0%)
Pronouns	674 (82.8%)	141 (17.2%)
Both together	923 (76.6%)	281 (23.3%)

OE Corpus (900--1000)	Dative-object before the verb	Dative-object after the verb
Nouns	95 (27.6%)	249 (72.4%)
Pronouns	495 (48.7%)	518 (51.3%)
Both together	587 (43.4%)	767 (56.6%)
EME Corpus (c1200)	Dative-object before the verb	Dative-object after the verb
Nouns	26 (23.0%)	88 (77.0%)
Pronouns	218 (43.0%)	288 (57.0%)
Both together	244 (39.4%)	376 (60.6%)

1拍	2拍	3拍	4拍	5拍	6拍	7拍	8拍	9拍	10拍	計
0.3	4.8	22.7	38.8	17.7	11.0	3.3	1.2	0.2	0.1	100

POS	FREQ	%
noun	7326	57.04%
verb	2501	19.47%
adjective	2420	18.84%
adverb	291	2.27%
preposition	68	0.53%
conjunction	21	0.16%
pronoun	15	0.12%
interjection	37	0.29%
past participle	57	0.44%
others	108	0.84%

researcher	category	result
Sereno (1986)	noun	out of 1425 nouns, 93% are trochaic
Sereno (1986)	verb	out of 523 verbs, 76% are iambic
Kelly & Bock (1988)	noun	out of 3202 nouns, 94% are trochaic
Kelly & Bock (1988)	verb	out of 1021 verbs, 69% are iambic
Amano (2009)	noun	out of 5766 nouns, 92.92% are trochaic
Amano (2009)	verb	out of 1184 verbs, 72.65% are iambic

statistics - hellog～英語史ブログ

■ #1283. 共起性の計算法 [corpus][statistics][bnc][collocation][lltest]

■ #1277. 文字をもたない言語の数は？ [world_languages][writing][statistics]

■ #1226. 近代英語期における語彙増加の年代別分布 [loan_word][lexicology][statistics][emode][renaissance][inkhorn_term][latin]

■ #1225. フランス借用語の分布の特異性 [lexicology][statistics][loan_word][french][lexical_stratification]

■ #1215. 属格名詞の衰退と of 迂言形の発達 [word_order][syntax][genitive][lexical_diffusion][statistics][synthesis_to_analysis]

■ #1214. 属格名詞の位置の固定化の歴史 [word_order][syntax][genitive][lexical_diffusion][statistics]

■ #1213. 間接目的語の位置の固定化の歴史 [word_order][syntax][lexical_diffusion][statistics]

■ #1211. 中英語のラテン借用語の一覧 [latin][loan_word][lexicology][me][wycliffe][bible][statistics]

■ #1209. 1250年を境とするフランス借用語の区分 [french][loan_word][me][norman_french][lexicology][statistics][bilingualism]

■ #1202. 現代英語の語彙の起源と割合 (2) [lexicology][loan_word][statistics][old_norse]

■ #1177. EU 仏語の退潮 [french][global_language][elf][statistics]

■ #1161. 英語と日本語における語彙の音節数別割合 [lexicology][statistics][syllable][corpus][japanese]

■ #1160. MRC Psychological Database より各種統計を視覚化 [lexicology][statistics][syllable][corpus]

■ #1159. MRC Psycholinguistic Database Search [cgi][web_service][lexicology][frequency][statistics]

■ #1158. MRC Psycholinguistic Database [web_service][lexicology][frequency][statistics]

■ #1132. 英単語の品詞別の割合 [lexicology][corpus][statistics]

■ #1131. 2音節の名詞と動詞に典型的な強勢パターン [stress][diatone][statistics]

■ #1103. GSL による Zipf's law の検証 [lexicology][statistics][frequency][zipfs_law][corpus]

■ #1102. Zipf's law と語の新陳代謝 [information_theory][frequency][statistics][zipfs_law][shortening][language_change]

■ #1101. Zipf's law [information_theory][frequency][statistics][language_change][zipfs_law][shortening][pragmatics][zipfs_law]