corpus / hellog～英語史ブログ

最終更新時間: 2026-07-15 01:27

2010-10-10 Sun

■ #531. OED の引用データをコーパスとして使えるか [oed][corpus][representativeness]

　OED (2nd ed. CD-ROM) を歴史英語コーパスとして用いるという発想は特に電子版が出版されてから広く共有されてきた．実際に多くの研究で OED がコーパスとして活用されている．しかし，そもそもがコーパスとして編まれたわけではない OED 中の用例の集合をコーパスとみなして研究することは，どれくらい妥当なのだろうか．研究の道具について知ることは研究自身と同じくらい重要だと思われるので，このテーマに関連する Hoffmann の論文から要点をまとめてみたい．（私自身が道具としての OED の特徴をよく理解せずに研究に使っていたきらいがあるので，自分のための備忘録というつもりです．田辺春美先生の書かれた論文を参考にしています．）
　Hoffmann は OED の用例の集合をコーパスとして用いることができるかという疑問に対して，4つの観点からアプローチしている．各観点と，対応する Hoffmann の結論を要約する．

　(1) Selection criteria for the quotations
　　　"a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language" (19; cited from Sinclair) という厳密なコーパスの定義に照らせば，OED の用例の集合をコーパスと見なすことはできない．確かに，個々の見出し語下に納められている用例群が，その見出し語に注目した場合の適切なコーパスにならないということは言えるだろう．その語の特殊で低頻度の形態や意味がクローズアップされる傾向があるからである．しかし，特にある見出し語に注目するのでなければ，全体として OED の用例は各時代の英語を代表していると考えられ，コーパスとして活用することは妥当である．

　(2) Representativeness and balance of the quotations
　　　OED の用例は実際に何らかの典拠から引いてきた "true quotations" (20) である．編者によって作られた用例もないではないが，数はきわめて少ない．また，典拠のジャンルは多岐にわたり，極端に文学作品に限るなどの偏向がないので，ジャンルに関しては "representative" と言ってよい．ただし，各ジャンルが言語研究にとって適切な割合で分布しているわけではないので，"balanced" とは言えない．例えば Shakespeare が1人で33,000の用例を提供している事例などが挙げられる．OED をコーパスとして見立てる場合には，"balance" の点で注意を要する．

　(3) Reliability of the data format
　　　文中の一部が省略されているような用例が，平均して20?25%ほどある．ほとんどの省略では文の構造がいじられていないが，中には不適切な省略で文の構造が変化してしまっている例文もある．節以上の構造を調べるために OED を利用する場合には，注意が必要である．

　(4) Quantification of the results
　　　1年当たりの用例数をグラフにプロットすると，17世紀頃に4000例を越える小ピークが，19世紀に10000例を越える大ピークが認められるが，20世紀には激減する．一方で，用例を構成する語の数は時代にかかわらずおおむね13語程度と一定で，20世紀の用例がやや長めなのが目に留まる程度である．用例数が240万例を越える（初版は180万例ほどだった）ことと上記の平均語数から計算して，OED に含まれる用例の総語数は3300?3500万語と推定される．OED をコーパスとして用いる場合には，19世紀の用例数が特に多いことなどに注意して検索結果を解釈すべきだろう．

　最後に Hoffmann の結論部を引用する (26) ．OED の用例の集合は言語変化の傾向を大雑把に量的に表わすコーパスとして言語変化研究にとって有用である，という常識的な結論だが，具体的な数字が出されていて参考になった．

Although the OED quotations database is not a completely balanced and representative corpus, it can nevertheless provide the linguist with a wealth of useful information. The data it contains chiefly represents naturally occurring language, and the time-span covered is unmatched by any other source of computerized data. Even though over 20 per cent of all its quotations have been shortened, the large majority of these deletions is unlikely to distort the results of many diachronic studies of linguistic features. Given the nature of the data, normalized frequency counts might suggest an inappropriate level of precision, but tendencies in the development over time can nevertheless be expressed in quantitative terms. (26)

　・ The Oxford English Dictionary. 2nd ed. CD-ROM. Version 3.1. Oxford: OUP, 2004.
　・ Hoffmann, Sebastian. "Using the OED quotations database as a Corpus --- A Linguistic Appraisal." ICAME Journal 28 (April 2004): 17--30. Available online at http://icame.uib.no/ij28/index.html .
　・ Tanabe, Harumi. "The Rivalry of give up and its Synonymous Verbs in Modern English." Language Change and Variation from Old English and Late Modern English: A Festschrift for Minoji Akimoto. Ed. Merja Kytö, John Scahill, and Harumi Tanabe. Bern: Peter Lang, 2010. 253--75.

Referrer (Inside): [2015-03-29-1] [2012-10-10-1] [2011-01-29-1] [2010-10-15-1] [2010-10-14-1]

n	word	ice-sin.freq.	ice-sin.lst %	flob.freq.	flob.lst %	keyness
1	uh	8,230	0.74	8		19,246.0
2	you	18,175	1.64	7,258	0.29	17,768.5
3	uhm	3,838	0.35	0		9,021.1
4	ya	3,580	0.32	10		8,283.9
5	i	15,166	1.37	12,230	0.49	7,051.3
6	singapore	3,041	0.27	64		6,570.0
7	word	3,490	0.32	482	0.02	5,621.8
8	know	4,768	0.43	1,534	0.06	5,345.5
9	okay	2,296	0.21	28		5,112.0
10	so	6,759	0.61	4,452	0.18	4,113.8
11	lah	1,747	0.16	2		4,074.4
12	it's	3,585	0.32	1,186	0.05	3,949.9
13	your	3,485	0.31	1,642	0.07	2,972.2
14	oh	1,952	0.18	344	0.01	2,900.2
15	think	2,761	0.25	1,208	0.05	2,501.5
16	ah	1,288	0.12	142		2,204.9
17	we	5,884	0.53	5,406	0.22	2,190.7
18	is	15,022	1.36	20,588	0.83	2,027.9
19	don't	2,372	0.21	1,196	0.05	1,904.9
20	what	4,635	0.42	4,072	0.16	1,865.8

	while	whilst	whiles
(1) Dracula	14 (12.61%)	95 (85.59%)	2 (1.80%)
(2) LOB	517 (88.68%)	66 (11.32%)	0 (0.00%)
(3) BNC	48,761 (89.41%)	5,773 (10.59%)	0 (0.00%)
(4) Brown	592 (100.00%)	0 (0.00%)	0 (0.00%)
(5) OANC	7,893 (100.00%)	0 (0.00%)	0 (0.00%)
(6) COCA	246,207 (99.82%)	447 (0.18%)	0 (0.00%)

word	freq.
heir	1
Henri	1
herb	2
hereditary	3
Hermes	1
historian	1
historic	6
historical	1
HMO	10
homage	4
hommage	5
honest	24
honor	5
honorable	14
honorarium	1
honorary	13
honored	1
honorific	3
hour	135
hourglass	1
hourlong	3
hourly	1
hours-long	1

WORD	PERIOD	oft	often
REGEX	PERIOD	/\bofte?\b/	/\boft[ei]n\b/
OE	O1	0	0
	O2	72	0
	OX/2	4	0
	O3	45	0
	O2/3	32	0
	OX/3	106	0
	O4	9	0
	O2/4	8	0
	O3/4	37	0
	OX/4	2	0
ME	M1	67	0
	MX/1	20	0
	M2	60	4
	MX/2	9	1
	M3	63	4
	M2/3	15	0
	M4	15	7
	M2/4	3	0
	M3/4	17	1
	MX/4	20	0
EModE	E1	14	28
	E2	14	33
	E3	9	78

corpus - hellog～英語史ブログ

■ #531. OED の引用データをコーパスとして使えるか [oed][corpus][representativeness]

■ #518. Singapore English のキーワードを抽出 [text_tool][corpus][flob][ice][singapore_english][keyword]

■ #517. ICE 提供の7種類の地域変種コーパス [corpus][ice]

■ #510. アメリカ英語における whilst の消失 [corpus][coha][ame_bre][ame]

■ #509. Dracula に現れる whilst (2) [corpus][lob][brown][bnc][oanc][coca][lmode][conjunction]

■ #506. CoRD --- 英語歴史コーパスの情報センター [corpus][link]

■ #493. It's raining cats and dogs. [idiom][corpus][etymology]

■ #492. 近代英語期の強変化動詞過去形の揺れ [emode][verb][conjugation][variation][corpus][ppceme]

■ #477. That's gorgeous! (2) [coca][corpus][ame][semantic_change][americanisation]

■ #476. That's gorgeous! [bnc][corpus][bre][semantic_change][etymology][gender_difference]

■ #475. That's a whole nother story. [metanalysis][corpus][ame]

■ #462. BNC から取り出した発音されない語頭の <h> [corpus][bnc][oanc][ame][bre][h][spelling_pronunciation]

■ #461. OANC から取り出した発音されない語頭の <h> [corpus][oanc][ame][h][article]

■ #460. OANC ( Open American National Corpus ) [corpus][oanc][ame]

■ #428. The Brown family of corpora の利用上の注意 [corpus][ame_bre][brown]

■ #424. 現代アメリカ英語における wh- 関係代名詞の激減 [relative_pronoun][syntax][ame][corpus]

■ #381. oft と often の分布の通時的変化 [corpus][hc]

■ #368. コーパスは研究の可能性を広げた [corpus]

■ #367. コーパス利用の注意点 (2) [corpus]

■ #363. 英語コーパス発展の3軸 [corpus]