bnc / hellog～英語史ブログ

最終更新時間: 2026-06-21 20:34

2012-10-27 Sat

■ #1279. BNC の強みと弱み [bnc][corpus][representativeness]

　10月8--11日の4日間にわたり，立教大学英語教育研究所による主催で，Lancaster 大学名誉教授 Geoffrey Leech の公開講演会が開かれた．私は，2日目の "The British National Corpus: Both a Triumph and a Failure" と題する講演のみの参加だったが，聴きに行った．BNC 編者じきじきの作成秘話など，おもしろい話しが何点かあった．
　題名にある "triumph" と "failure" について，Leech はそれぞれ次のような項目を列挙していた．

A triumph:
　・ It has been claimed that the BNC is the most widely used corpus in the world.
　・ It was the first text corpus of its size to be made widely available.
　・ It is available from a wide range of different sources.
　・ It is widely regarded as a 'standard reference corpus' for the English language.
　・ It has been licensed to over 1300 institutions throughout the world, over 1800 users have signed on for access to it through the BNCweb online interface, etc.

A failure:
　・ It never reached 100 million words! (98,300,000)
　・ The design criteria were never totally achieved.
　・ It hardly ever contains complete texts.
　・ The spoken materials are poorly transcribed.
　・ The metadata are incomplete and can be erroneous.
　・ The part-of-speech tagging contains many errors.
　・ It is out of date! (dating from the late 20^th century)

　Leech の言葉の端々には，triumph の各点に示されているように，実績に裏付けされた自信がみなぎっていた．一方，自らのコーパス編集について，こうすればよかった，ああすればよかったという類の後悔ともいえる反省点を多く挙げていたのが印象的である．BNC のタグ付けに用いられたプログラム CLAWS4 の精度が97%ほどある（Hoffmann et al. 43 によると，98--99%）というのは，私は驚くべきことだと思っていたが，コーパス規模が大きいので数パーセントのエラーとはいっても約300万件にのぼるという事実は見落としていた．話しことばコーパスについては，コーパス全体の1割ほどしか含められなかったこと，音声データの transcription の質が悪かったこと，当初採用したデータフォーマット TEI が，話しことばのタグ付けには必ずしも適切でなかったこと，などを挙げていた．
　なかでも，企画段階から現在に至るまで一貫してこだわり続けている代表性 (representativeness) について，BNC では完全に目的を果たせなかったことに，後悔をにじませていた．企画段階から，設定する Text Domain のバランスやサイズに関する議論が重ねられてきたことはよく知られている．1ユーザーとしては，限られたリソースのなかで，あれだけの代表性を確保したことは偉業だと評価しているが，Leech にとっては，できる限りのことはやったという自負の反面として，理想が果たせなかったという思いも強いようだ．同時に，穏やかな口調ではあったが，BNC と比較される他のすべての大規模コーパスが，代表性をさほど重視していない点を批判していた．ただし，彼自身が述べているように，コーパスの代表性について独自の理論はもっているが，最終的には "impressionistic" な判断の問題だと考えているようであり，この問題の難しさをにじませていた．いずれにせよ，Leech の代表性への執念の強さに，高度なプロフェッショナリズムを感じた．
　なお，[2012-07-05-1]の記事「#1165. 英国でコーパス研究が盛んになった背景」で触れた通り，残念ながらBNCの続編はないだろうということを，Leech は明言していた．
　扱う時代は大きく異なるが，初期中英語コーパス The LAEME Corpus の代表性の問題について，[2012-10-10-1], [2012-10-11-1]の記事で考察したので，ご参照を．

　・ Hoffmann, Sebastian, Stefan Evert, Nicholas Smith, David Lee, and Ylva Berglund Prytz. Corpus Linguistics with BNCweb : A Practical Guide. Frankfurt am Main: Peter Lang, 2008.

	Statutes	Other texts
ME4 (1420--1500)	68 (60)	621 (31)
EModE1 (1500--70)	77 (65)	503 (28)
EModE2 (1570--1640)	84 (71)	461 (26)
EModE3 (1640--1710)	126 (96)	191 (12)

Rank	Under 35		Over 35
Rank	Word	χ²	Word	χ²
1	mum	1409.3	yes	2365.0
2	fucking	1184.6	well	1059.8
3	my	762.4	mm	895.2
4	mummy	755.2	er	773.8
5	like	745.2	they	682.2
6	na as in wanna and gonna	712.8	said	538.3
7	goes	606.6	says	443.1
8	shit	410.1	were	385.8
9	dad	403.7	the	352.2
10	daddy	380.1	of	314.6
11	me	371.9	and	224.7
12	what	357.3	to	211.2
13	fuck	330.1	mean	155.0
14	wan as in wanna	320.6	he	144.0
15	really	277.0	but	139.0
16	okay	257.0	perhaps	136.0
17	cos	254.4	that	131.3
18	just	251.8	see	122.1
19	why	240.0	had	118.3

Rank	Characteristically male		Characteristically female
Rank	Word	χ²	Word	χ²
1	fucking	1233.1	she	3109.7
2	er	945.4	her	965.4
3	the	698.0	said	872.0
4	year	310.3	n't	443.9
5	aye	291.8	I	357.9
6	right	276.0	and	245.3
7	hundred	251.1	to	198.6
8	fuck	239.0	cos	194.6
9	is	233.3	oh	170.2
10	of	203.6	Christmas	163.9
11	two	170.3	thought	159.7
12	three	168.2	lovely	140.3
13	a	151.6	nice	134.4
14	four	145.5	mm	133.8
15	ah	143.6	had	125.9
16	no	140.8	did	109.6
17	number	133.9	going	109.0
18	quid	124.2	because	105.0
19	one	123.6	him	99.2
20	mate	120.8	really	97.6
21	which	120.5	school	96.3
22	okay	119.9	he	90.4
23	that	114.2	think	88.8
24	guy	108.6	home	84.0
25	da	105.3	me	83.5

	though	although
Natural and pure sciences	56.3	80.13
Applied science	37.36	68.31
World affairs	45.81	68.2
Social science	48.98	63.38
Commerce and finance	46.18	57.21
Arts	74.07	52.93
Leisure	45.85	49.46
Belief and thought	70.78	46.75
Imaginative prose	80.2	26.37

	BNC_Male_Speakers	BNC_Female_Speakers
new	149	91
good	408	310
free	173	75
fresh	84	118
delicious	12	34
full	210	107
sure	532	328
clean	197	223
wonderful	270	258
special	177	82
crisp	10	16
fine	347	215
big	470	415
great	203	96
real	163	80
easy	326	157
bright	113	110
extra	347	203
safe	182	92
rich	120	45
#--------
corpus_size	4949938	3290569

Category	No. of words	No. of hits	Dispersion (over files)	Frequency per million words
Spoken	10,409,858	579	63/908	55.62
Written	87,903,571	743	172/3,140	8.45
total	98,313,429	1,322	235/4,048	13.45

	Corpus 1	Corpus 2	Total
Frequency of word	a	b	a+b
Frequency of other words	c-a	d-b	c+d-a-b
Total	c	d	c+d

bnc - hellog～英語史ブログ

■ #1279. BNC の強みと弱み [bnc][corpus][representativeness]

■ #1278. BNC を中心とするコーパス研究関連のリンク集 [corpus][bnc][link][web_service][lltest]

■ #1276. hereby, hereof, thereto, therewith, etc. [compounding][synthesis_to_analysis][adverb][register][corpus][bnc][hc]

■ #1105. 美女の形容としての grey eyes (2) [romance][adjective][collocation][bnc][corpus]

■ #1088. lingua franca (3) [elf][model_of_englishes][global_language][bnc]

■ #1068. choose between war or peace [conjunction][corpus][bnc][preposition]

■ #1035. 列挙された人称代名詞の順序 [personal_pronoun][corpus][bnc][honorific]

■ #987. Don't drink more pints of beer than you can help. (1) [negative][comparison][idiom][syntax][corpus][bnc]

■ #930. a large number of people の数の一致 [agreement][number][syntax][bnc][corpus]

■ #914. BNC による語彙の世代差の調査 [bnc][corpus][statistics][lltest][interjection]

■ #913. BNC による語彙の男女差の調査 [bnc][corpus][statistics][lltest][interjection][gender_difference]

■ #845. 現代英語の語彙の起源と割合 [lexicology][loan_word][statistics][bnc][corpus]

■ #757. decline + 蜍募錐隧杣syntax] [gerund][bnc][corpus]

■ #737. 構文の contamination [blend][contamination][syntax][superlative][bnc][corpus]

■ #711. Log-Likelihood Tester CGI, Ver. 2 [corpus][bnc][statistics][web_service][cgi][lltest]

■ #710. though と although の語法の差 (2) [bnc][corpus][lltest][conjunction][statistics]

■ #708. Frequency Sorter CGI [corpus][bnc][statistics][web_service][cgi][lexicology][plural]

■ #697. Log-Likelihood Tester CGI [corpus][bnc][statistics][web_service][cgi][lltest][sociolinguistics]

■ #696. Log-Likelihood Test [corpus][bnc][statistics][lltest]

■ #668. Chaucer の knight との脚韻語 [chaucer][bnc][collocation][kyng_alisaunder]