statistics / hellog～英語史ブログ

最終更新時間: 2026-07-15 01:27

2020-12-10 Thu

■ #4245. 頻度と漸近双曲線 (A-curve) [lexical_diffusion][zipfs_law][frequency][statistics][language_change][uniformitarian_principle]

　variationist の立場を高度に押し進めた言語（変化）観を提案する，Kretzschmar and Tamasi の論考を読んだ．"A-curve", "asymptotic hyperbolic distribution", "power law", "S-curve" などの用語が連発し思わず身構えてしまう論文だが，言わんとしていることは Zipf's Law (cf. zipfs_law) の発展版のように思われる．低頻度の言語項は多く，高頻度の言語項は少ないということだ．
　ある英語コーパスにおいて，1度しか現われない語は相当数ある．一方，the, of, have などは超高頻度で現われるが，主として機能語であり種類数でいえば相当に限定される．例えば，1回しか現われない語 ( x = 1 ) は1000個 ( y = 1000 ) あるが，1000回も現われる語 ( x = 1000 ) は the の1語しかない ( y = 1 ) とすると，これを座標上にプロットしてみれば第1象限の左上と右下に点が打たれることになる．この2点を両端として，その間の点を次々と埋めていくと，y = 1/x で表わせるような漸近双曲線 (asymptotic hyperbolic curve) の片割れに近づくだろう．これを Kretzschmar and Tamasi は "A-curve" と呼んでおり，背後にある法則を "power law" （べき乗則）と呼んでいる．後者は "few realizations that occur very frequently and many realizations that occur infrequently" (384) ということである．
　Kretzschmar and Tamasi は，アメリカ方言における訛語や調音の variants を調査し，各種の変異形について頻度の分布を取った．結果として，いずれのケースについても "A-curve" が観察されることを示した．
　また，Kretzschmar and Tamasi は，語彙拡散 (lexical_diffusion) との関連でしばしば言及される "S-curve" と，彼らの "A-curve" との関係についても議論している．同一の言語変化を異なる軸に着目してプロットすると "S-curve" にも "A-curve" にもなり，両者は矛盾しないどころか，親和性が高いという．
　私の拙い言葉使いでは上手く解説することができないのだが，言語体系や言語変化を徹底的に variationist に眺めようとすると，このような言語観あるいは言語理論になるのかと感心した．Kretzschmar and Tamasi (394) より，とりわけ重要と思われる箇所を引用する．

Our second observation, about the distribution of variants according to Zipf's Law, has the strongest set of implications for historical study of language. If we take the A-curve as the model for the frequency distribution of variants for any linguistic feature of interest to us at any moment in time, then we should expect that any particular variant of interest to us will have a particular rank along the A-curve. Therefore, one of the things that we should try to do for any given moment in time is to determine the place of our variant of interest on the curve; we need to know whether it is the most frequent variant in the set of possible realizations (at the top of the curve), or an infrequent variant (in the tail of the curve). Then, for any subsequent moment in time, we can again try to determine the location of our variant of interest along the curve, and so try to make a statement about whether the location of the variant has changed in the intervening time (see Figure 14). Since we hypothesize that an A-curve will exist for every feature at any moment in time (i.e., that language will not suddenly become invariant), we can define the notion "linguistic change" itself as the change in the location of the target variant at different heights along the curve. If a particular variant occurs at a higher place on the curve than it did before, it has become more frequent and so we can say that the direction of change for that variant is positive; if a variant occurs at a lower place on the curve than it did before, it has become less frequent and the direction of change is negative.

A-curves at different moments in time (Kretzschmar and Tamasi 395)

　・ Kretzschmar, Jr.,William A and Susan Tamasi. "Distributional Foundations for a Theory of Language Change." World Englishes 22 (2003): 377--401.

Phonemes	/p/	/b/	/t/	/d/	/k/	/g/	/ʧ/	/ʤ/	/m/	/n/	/l/	/r/	/f/	/v/	/s/	/z/	/ʃ/	/w/	/eɪ/	/aɪ/	/ɔɪ/	/aʊ/	/ɪ/	/ʊ/
Letters	<p	b	t	d	c	g	ch	g(e)	m	n	l	r	f	v	s/c	s	ti	(q)u	a	i	oy	ow	i	u>
<-ant/ce>	4	1	62	26	13	9	2	4	6	40	24	62	4	14	10	13	-	-	2	11	6	1	16	6
<-ent/ce>	8	5	33	62	-	-	-	42	1	24	48	27	-	4	66	4	15	17	-	-	-	-	49	13

Commas	47
Full stops	45
Dashes	2
Parentheses	2
Semi-colons	2
Question marks	1
Colons	1
Exclamation marks	1

Corpus	Proverbs	BE06
tokens (running words) in text	6,276	1,011,020
types (distinct words)	1,616	45,298
type/token ratio (TTR)	25.75	4.48
standardised TTR	45.25	43.90
STTR std.dev.	46.42	54.62
STTR basis	1,000	1,000
mean word length (in characters)	4.09	4.69
word length std.dev.	1.92	2.58
sentences	869	53,466
mean (in words)	7.22	18.91
std.dev.	2.86	14.38
1-letter words	292	38,775
2-letter words	1,020	168,273
3-letter words	1,345	205,211
4-letter words	1,370	166,961
5-letter words	996	110,856
6-letter words	553	88,195
7-letter words	359	79,174
8-letter words	163	56,645
9-letter words	96	39,767
10-letter words	53	26,170
11-letter words	17	15,493
12-letter words	6	8,208
13-letter words	4	4,557
14-letter words	1	1,687
15-letter words	1	623

Suffix	Number
-y	2
-ery	8
-ancy	10
-ency	10
-ence	18
-ion	20
-ance	49
-al	56
-ure	96
-ation	190
-ment	258

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
Top_100	1.0	2.0	3.0	3.1	4.0	5.0
Top_200	1.00	3.00	4.00	3.77	4.00	10.00
Top_500	1.000	4.000	4.000	4.498	5.000	10.000
Top_1K	1.000	4.000	5.000	4.968	6.000	15.000
Top_2K	1.000	4.000	5.000	5.406	7.000	15.000
Top_5K	1.000	5.000	6.000	6.014	7.000	16.000
Top_10K	1.000	5.000	6.000	6.488	8.000	16.000
Top_20K	1.000	5.000	7.000	6.954	8.000	17.000
Top_50K	1.000	6.000	7.000	7.622	9.000	20.000

statistics - hellog～英語史ブログ

■ #4245. 頻度と漸近双曲線 (A-curve) [lexical_diffusion][zipfs_law][frequency][statistics][language_change][uniformitarian_principle]

■ #4144. 20世紀の語彙爆発 [oed][pde][lexicology][statistics][renaissance]

■ #4140. 英語に借用された日本語の「いつ」と「どのくらい」 [oed][japanese][borrowing][loan_word][lexicology][statistics]

■ #4138. フランス借用語のうち中英語期に借りられたものは4割強で，かつ重要語 [oed][french][latin][loan_word][statistics][lexicology][borrowing]

■ #4130. 英語語彙の多様化と拡大の歴史を視覚化した "The OED in two minutes" [oed][map][lexicology][borrowing][lexicography][philology][statistics][web_service][hel_education]

■ #4030. -ant か -ent か (3) [spelling][suffix][phoneme][statistics]

■ #3891. 現代英語の様々な句読記号の使用頻度 [punctuation][alphabet][diacritical_mark][net_speak][brown][corpus][frequency][statistics][exclamation_mark]

■ #3789. 古英語語彙におけるラテン借用語比率は1.75% [latin][loan_word][borrowing][oe][lexicology][statistics]

■ #3788. 古英語期以前のラテン借用語の意外な日常性 [latin][loan_word][oe][lexicology][lexical_stratification][statistics][borrowing]

■ #3756. アイルランド語からの借用語の年代別分布 [loan_word][irish][celtic][statistics][lexicology]

■ #3421. 英語ことわざの文体・語彙的特徴を示す統計値 [proverb][statistics][corpus][stylistics]

■ #3419. 英語ことわざのキーワード [proverb][keyword][statistics][corpus]

■ #3400. 英語の中核語彙に借用語がどれだけ入り込んでいるか？ [loan_word][borrowing][lexicology][semantics][oed][htoed][statistics]

■ #3372. 古英語と中英語の資料の制約について数点のメモ [oe][me][philology][manuscript][statistics][representativeness][methodology][evidence]

■ #3259. 17世紀に作られた動詞派生名詞群の呈する問題 (2) [synonym][loan_word][borrowing][renaissance][inkhorn_term][emode][lexicology][word_formation][suffix][affixation][neologism][derivation][statistics]

■ #3216. ドーキンスと言語変化論 (2) [glottochronology][evolution][biology][language_change][comparative_linguistics][history_of_linguistics][speed_of_change][statistics]

■ #3180. 徐々に高頻度語の仲間入りを果たしてきたフランス・ラテン借用語 [french][latin][loan_word][borrowing][frequency][statistics][lexicology][hc][bnc]

■ #3174. 高頻度語はスペリングが短い (2) [frequency][spelling][orthography][zipfs_law][statistics][lexicology][corpus]

■ #3173. 高頻度語はスペリングが短い (1) [frequency][spelling][orthography][zipfs_law][statistics][lexicology][corpus][three-letter_rule]

■ #3170. 現代日本語の語種分布 (2) [japanese][lexicology][statistics][etymology][loan_word][lexical_stratification]