frequency / hellog～英語史ブログ

最終更新時間: 2026-02-06 10:29

2018-01-03 Wed

■ #3173. 高頻度語はスペリングが短い (1) [frequency][spelling][orthography][zipfs_law][statistics][lexicology][corpus][three-letter_rule]

　標題は特に目新しい指摘ではなく，英語を読み書きする者には直感されていることだと思われる．「#1091. 言語の余剰性，頻度，費用」 ([2012-04-22-1]) や「#1102. Zipf's law と語の新陳代謝」 ([2012-05-03-1]) でも指摘したように，よく読み書きする単語のスペリングは短いほうが効率がよいと考えられるからだ．逆に，滅多に読み書きしない単語であれば少々長くても我慢できる．単語のスペリングに限らず，単語の音形についても同様の原理が作用していると思われる．
　また，英語の正書法には内容語は3文字以上で綴られなければならないという「#2235. 3文字規則」 ([2015-06-10-1]) がある．これは機能語という頻度のきわめて高い語類については適用されない．したがって，この規則は上記の効率の問題とも関わる実用的な側面をもつといえる．
　高頻度語であればあるほど，そのスペリングが平均的に短いことを示す方法の1つに，頻度ランキングのトップ100語，1000語，10000語などのリストに基づき，文字数別に単語を数え上げるというやり方がある．「#2096. SUBTLEX-US Word Frequency List」 ([2015-01-22-1]) から引き出した頻度ランキングを利用して，トップ100語，200語，500語，1000語，2000語，5000語，10000語，20000語，50000語について調査した．トップ100語のリストについては先の記事でリストを掲載している通りであり，なかには s, ll などコーパスの仕様に由来するとおぼしき怪しい「語」もあるが，結果の大勢には影響を及ぼさないだろう．
　以下にグラフで整理した通り，結果は明白である（数値データはソースHTMLを参照）．トップ100語の超高頻度語群では62.00%までが3文字以下のスペリングである．3文字以下の割合（下から3つ分のオレンジの帯まで）ということで比べていくと，トップ200語から50000語の調査結果まで，順に41.50%, 24.60%, 17.00%, 12.65%, 8.06%, 6.01%, 4.55%, 3.20%と目減りしていく．

Length of Spelling of High-Frequency Words by SUBTLEXus

Referrer (Inside): [2018-01-04-1]

BRITISH		a	d	f	h	j	l	m	n	p	s	x	y	TOTAL
1600--49	files	0	10	0	0	0	10	0	0	10	0	0	0	30
1600--49	words	0	32,342	0	0	0	21,026	0	0	32,741	0	0	0	86,109
1650--99	files	0	10	11	10	10	10	21	10	0	10	75	10	177
1650--99	words	0	30,328	41,667	21,818	21,186	20,466	23,811	22,304	0	21,427	38,767	20,488	262,262
1700--49	files	0	10	11	10	11	10	14	10	0	10	77	10	173
1700--49	words	0	27,862	44,057	21,511	23,265	21,315	22,066	21,612	0	20,812	33,896	20,495	256,891
1750--99	files	10	10	10	10	10	10	20	10	0	10	70	11	181
1750--99	words	25,386	27,484	45,198	21,752	21,284	20,367	21,002	23,172	0	20,599	29,589	23,043	278,876
1800--49	files	10	10	10	10	11	10	10	10	0	10	25	10	126
1800--49	words	30,804	31,211	45,107	21,777	23,249	20,531	20,286	22,951	0	21,015	12,671	20,883	270,485
1850--99	files	10	10	10	10	10	10	10	10	0	10	26	10	126
1850--99	words	30,684	34,856	43,427	21,322	21,243	20,757	22,265	23,072	0	21,810	10,819	21,789	272,044
1900--49	files	10	11	10	10	10	10	10	10	0	10	29	10	130
1900--49	words	26,717	31,391	45,408	21,123	22,208	21,160	20,213	21,977	0	21,664	12,529	22,424	266,814
1950--99	files	10	11	10	10	10	10	13	10	0	10	28	10	132
1950--99	words	23,437	32,200	45,109	21,093	22,723	20,721	20,994	22,935	0	21,385	11,361	22,060	264,018
TOTAL	files	50	82	72	70	72	80	98	70	10	70	330	71	1,075
TOTAL	words	137,028	247,674	309,973	150,396	155,158	166,343	150,637	158,023	32,741	148,712	149,632	151,182	1,957,499
AMERICAN		a	d	f	h	j	l	m	n	p	s	x	y	TOTAL
1750--99	files	3	10	10	10	10	12	9	10	0	10	58	10	152
1750--99	words	9,214	29,980	38,980	21,271	21,896	41,177	23,541	22,265	0	20,668	27,860	21,315	278,167
1800--49	files	1	10	10	0	10	12	0	10	0	10	10	10	83
1800--49	words	2,822	40,568	44,676	0	21,476	33,409	0	37,107	0	20,904	20,739	20,695	242,396
1850--99	files	8	10	11	10	10	10	10	10	0	10	28	11	128
1850--99	words	24,480	32,721	44,394	21,056	22,436	28,506	20,547	21,994	0	21,311	11,361	23,419	272,225
1900--49	files	10	10	10	0	10	11	0	15	0	10	52	10	138
1900--49	words	30,460	52,514	53,430	0	21,661	21,607	0	22,802	0	20,984	25,021	20,731	269,210
1950--99	files	10	10	10	10	10	12	10	10	0	12	30	10	134
1950--99	words	29,563	31,037	44,382	21,051	22,109	25,517	22,617	23,069	0	25,623	11,961	21,654	278,583
TOTAL	files	32	50	51	30	50	57	29	55	0	52	178	51	635
TOTAL	words	96,539	186,820	225,862	63,378	109,578	150,216	66,705	127,237	0	109,490	96,942	107,814	1,340,581

	GSL	CELEX2
1%	47.05%	69.36%
0.1%	14.60%	43.57%

frequency - hellog～英語史ブログ

■ #3173. 高頻度語はスペリングが短い (1) [frequency][spelling][orthography][zipfs_law][statistics][lexicology][corpus][three-letter_rule]

■ #2876. 英語語彙の頻度分布に関する格差上位1%のシェア [lexicology][statistics][frequency][corpus]

■ #2875. 英語語彙の頻度分布の格差をジニ係数とローレンツ曲線でみる [lexicology][statistics][frequency][zipfs_law][corpus]

■ #2690. N-gram Tool [cgi][n-gram][statistics][corpus][web_service][frequency][cgi]

■ #2661. Swadesh (1952) の選んだ言語年代学用の200語 [glottochronology][lexicology][frequency][statistics]

■ #2660. glottochronology と基本語彙 [glottochronology][lexicology][statistics][history_of_linguistics][frequency][anthropology]

■ #2659. glottochronology と lexicostatistics [glottochronology][lexicology][statistics][terminology][speed_of_change][frequency]

■ #2520. 後期中英語の134種類の "such" の異綴字 [spelling][lme][lalme][corpus][scribe][me_dialect][frequency]

■ #2363. hapax legomenon [hapax_legomenon][terminology][lexicology][lexicography][word_formation][productivity][bible][zipfs_law][frequency][corpus][shakespeare][chaucer]

■ #2324. n-gram [corpus][information_theory][coca][bnc][google_books][statistics][n-gram][collocation][frequency][link]

■ #2176. 文法化・意味変化と頻度 [frequency][grammaticalisation][semantic_change][schedule_of_language_change][language_change]

■ #2121. 英語史における /t/ の挿入と脱落の例 [phonetics][dialect][consonant][frequency]

■ #2110. 言語（変化）の使用基盤モデル [cognitive_linguistics][usage-based_model][language_change][frequency][collocation][speed_of_change]

■ #2096. SUBTLEX-US Word Frequency List [frequency][statistics][corpus][lexicology][zipfs_law][cgi][web_service]

■ #1970. 多義性と頻度の相関関係 [polysemy][zipfs_law][information_theory][frequency][statistics]

■ #1906. 言語変化のスケジュールは言語学的環境ごとに異なるか [speed_of_change][schedule_of_language_change][lexical_diffusion][wave_theory][frequency]

■ #1874. 高頻度語の語義の保守性 [semantics][semantic_change][frequency][speed_of_change]

■ #1864. ら抜き言葉と頻度効果 [frequency][lexical_diffusion][japanese][ranuki][japanese]

■ #1802. ARCHER 3.2 [corpus][archer][mode][frequency]

■ #1743. ICE Frequency Comparer [corpus][web_service][cgi][frequency][new_englishes][variety][ice]