corpus / hellog～英語史ブログ

最終更新時間: 2026-02-06 10:29

2012-10-28 Sun

■ #1280. コーパスの代表性 [corpus][representativeness][variety][idiolect][methodology]

　コーパスにとって代表性 (representativeness) が命であることは，コーパスの定義上 ([2010-11-16-1]) あきらかであるし，昨日の記事「#1279. BNC の強みと弱み」 ([2012-10-28-1]) で紹介した Leech もとりわけ主張している点である．McEnery et al. (13) は，代表性について，Leech の定義を参考にしながら "a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety" と述べている．
　代表性を具体的に考えてみよう．例えば BNC がターゲットとするような，現代イギリス英語という一般的な変種を収録するコーパス (general corpus) の代表性はどのようにすれば得られるのか，その理論化は難しい．話し言葉と書き言葉の割合の問題を考えると，それぞれを50%ずつに割り振ることは，現代イギリス英語の代表性を約束してくれるだろうか．Leech の表現でいえば "impressionistic" とならざるを得ないが，今この瞬間に行なわれている現代イギリス英語の圧倒的な部分が，話し言葉においてではないか．もしそうだとすれば，話し言葉コーパスの割合を，例えば80%ほどに設定するほうがより代表性を確保できるのではないか．母体となる現代イギリス英語の全体像を直接つかむことができない以上，その代表性の議論は行き詰まってしまう．
　コーパス（特に一般コーパス）の代表性という場合に，これを balance と sampling という2つの概念に分けて考えることがある．McEnery et al. (13) では，"the representativeness of most corpora is to a great extent determined by two factors: the range of genres included in a corpus (i.e. balance . . .) and how the text chunks for each genre are selected (i.e. sampling . . .)" と説明されている．
　balance とは，BNC の用語でいうところの domain や genre という分類の設定に関するものである．例えば，現代イギリス英語のコーパスを標榜しながらも，イギリスの新聞の英語だけを集めたコーパスは，representativeness の点で難がある．現代イギリス英語には書き言葉だけでなく話し言葉もあるし，前者については新聞英語だけでなく文学英語もあれば電子メール英語もあるし，買い物メモ英語もあれば，日記英語もある．これらのあらゆる domain や genre を考慮に入れたいと思うが，果たしていくつの text domain があるのだろうか．新聞英語に限っても，タブロイドもあれば高級紙もある．1つの新聞内でも，社会面，スポーツ面，社説などを区別する必要はないのか，社会面であれば国内記事と国際記事の区別はどうか，等々．理論的にはどこまでも細分化しうる．話し言葉でも同様に細分化を推し進めていけば，個人語 (idiolect) ，さらに個人語における register 別の現われ，などのアトムへと終着してしまう．実際のコーパス作成上は，常識的なレベルで妥協することになるが，「常識的」と "impressionistic" はほぼ同義だろう．
　sampling とは代表性を得るための手法である．母体の言語的特徴が再現されるように，質と量の点において考慮を加えながら，コーパス内に各 domain を案配するための理論と実践である．ここには，sampling unit として何を設定するか（典型的には，本，雑誌，新聞などの製品としての単位），そのような単位をリスト化する作業の範囲 (sampling frame) をどこまでに設定するか（特定の年への限定や，ベストセラー本への限定など），標本収集は完全なランダムにするかある程度の体系化を加えた上でのランダムにするか，著作権の問題をどう乗り越えるかなどの，理論的・実践的な問題が含まれる．
　代表性に関わるもう1つの概念として，closure あるいは saturation と呼ばれるものもある．McEnery et al. (16) によれば，"Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of language (e.g. computer manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point." と説明されている．平たくいえば，これ以上コーパスの規模を大きくしても，語彙構成の割合は変わらないという規模に到達すれば，そのコーパスは saturated であると考えられる．代表性の指標としては，balance よりも saturation のほうがすぐれているという指摘もあるが，saturation は主として語彙が念頭にあり，他の言語項目への応用は試みられていないのが現状である．
　代表性は，定義上コーパスの命であるとはいっても，定義先行というきらいはある．それを確保するための理論もないし，検証法もない．すべてのコーパス編纂者に立ちはだかる頭の痛い問題だろうが，コーパスは次々と編纂されている．理論的な問題は別にして，ひたすら編纂と使用を続けてゆき，ノウハウをため込むべき段階にあるのかもしれない．

　・ McEnery, Tony, Richard Xiao, and Yukio Tono. Corpus-Based Language Studies: An Advanced Resource Book. London: Routledge, 2006.

Referrer (Inside): [2025-06-20-1] [2020-03-07-1] [2019-05-21-1] [2016-12-05-1] [2016-05-24-1] [2014-01-07-1] [2012-10-28-1]

	Statutes	Other texts
ME4 (1420--1500)	68 (60)	621 (31)
EModE1 (1500--70)	77 (65)	503 (28)
EModE2 (1570--1640)	84 (71)	461 (26)
EModE3 (1640--1710)	126 (96)	191 (12)

	C12b	C13a	C13b	C14a	Total
N	0 (0.000%)	362 (0.062)	0 (0.000)	52,883 (9.083)	53,245 (9.146)
NEM	11,342 (1.948)	0 (0.000)	3,980 (0.684)	2,344 (0.403)	17,666 (3.034)
NWM	0 (0.000)	58,332 (10.019)	16,173 (2.778)	0 (0.000)	74,505 (12.797)
SEM	40,082 (6.885)	26,722 (4.590)	21,921 (3.765)	31,408 (5.395)	120,133 (20.634)
SWM	1,030 (0.177)	90,400 (15.527)	106,981 (18.375)	108 (0.019)	198,519 (34.098)
SW	1,168 (0.201)	2,610 (0.448)	46,032 (7.907)	30,517 (5.242)	80,327 (13.797)
SE	0 (0.000)	4,043 (0.694)	3,199 (0.549)	30,561 (5.249)	37,803 (6.493)
Total	53,622 (9.210)	182,469 (31.341)	198,286 (34.058)	147,821 (25.390)	582,198 (100.000)

	C12b	C13a	C13b	C14a	Total
N	0 (0.00%)	1 (0.86)	0 (0.00)	7 (6.03)	8 (6.90)
NEM	1 (0.86)	0 (0.00)	5 (4.31)	2 (1.72)	8 (6.90)
NWM	0 (0.00)	9 (7.76)	5 (4.31)	0 (0.00)	14 (12.07)
SEM	4 (3.45)	7 (6.03)	14 (12.07)	7 (6.03)	32 (27.59)
SWM	2 (1.72)	13 (11.21)	17 (14.66)	1 (0.86)	33 (28.45)
SW	3 (2.59)	5 (4.31)	7 (6.03)	2 (1.72)	17 (14.66)
SE	0 (0.00)	2 (1.72)	1 (0.86)	1 (0.86)	4 (3.45)
Total	10 (8.62)	37 (31.90)	49 (42.24)	20 (17.24)	116 (100.00)

1拍	2拍	3拍	4拍	5拍	6拍	7拍	8拍	9拍	10拍	計
0.3	4.8	22.7	38.8	17.7	11.0	3.3	1.2	0.2	0.1	100

POS	FREQ	%
noun	7326	57.04%
verb	2501	19.47%
adjective	2420	18.84%
adverb	291	2.27%
preposition	68	0.53%
conjunction	21	0.16%
pronoun	15	0.12%
interjection	37	0.29%
past participle	57	0.44%
others	108	0.84%

corpus - hellog～英語史ブログ

■ #1280. コーパスの代表性 [corpus][representativeness][variety][idiolect][methodology]

■ #1279. BNC の強みと弱み [bnc][corpus][representativeness]

■ #1278. BNC を中心とするコーパス研究関連のリンク集 [corpus][bnc][link][web_service][lltest]

■ #1276. hereby, hereof, thereto, therewith, etc. [compounding][synthesis_to_analysis][adverb][register][corpus][bnc][hc]

■ #1264. 歴史言語学の限界と，その克服への道 [methodology][uniformitarian_principle][writing][history][sociolinguistics][laeme][corpus][representativeness][evidence]

■ #1263. The LAEME Corpus の代表性 (2) [laeme][corpus][representativeness]

■ #1262. The LAEME Corpus の代表性 (1) [laeme][corpus][representativeness]

■ #1243. 語の頻度を考慮する通時的研究のために [frequency][corpus][representativeness]

■ #1218. 話し言葉にみられる whom の衰退 [pde_language_change][interrogative][relative_pronoun][corpus][ame][preposition_stranding]

■ #1165. 英国でコーパス研究が盛んになった背景 [corpus][history_of_linguistics][methodology]

■ #1161. 英語と日本語における語彙の音節数別割合 [lexicology][statistics][syllable][corpus][japanese]

■ #1160. MRC Psychological Database より各種統計を視覚化 [lexicology][statistics][syllable][corpus]

■ #1132. 英単語の品詞別の割合 [lexicology][corpus][statistics]

■ #1105. 美女の形容としての grey eyes (2) [romance][adjective][collocation][bnc][corpus]

■ #1103. GSL による Zipf's law の検証 [lexicology][statistics][frequency][zipfs_law][corpus]

■ #1098. 情報理論が言語学に与えてくれる示唆を2点 [information_theory][redundancy][corpus]

■ #1068. choose between war or peace [conjunction][corpus][bnc][preposition]

■ #1041. COCA の "ANALYZE TEXT" [coca][corpus][web_service][academic_word_list][text_tool]

■ #1035. 列挙された人称代名詞の順序 [personal_pronoun][corpus][bnc][honorific]

■ #987. Don't drink more pints of beer than you can help. (1) [negative][comparison][idiom][syntax][corpus][bnc]