representativeness / hellog～英語史ブログ

最終更新時間: 2026-07-10 19:54

2020-03-07 Sat

■ #3967. コーパス利用の注意点 (3) [corpus][methodology][representativeness]

　標題については，以下の記事を含む様々な機会に取り上げてきた．

　・「#307. コーパス利用の注意点」 ([2010-02-28-1])
　・「#367. コーパス利用の注意点 (2)」 ([2010-04-29-1])
　・「#428. The Brown family of corpora の利用上の注意」 ([2010-06-29-1])
　・「#1280. コーパスの代表性」 ([2012-10-28-1])
　・「#2584. 歴史英語コーパスの代表性」 ([2016-05-24-1])
　・「#2779. コーパスは英語史研究に使えるけれども」 ([2016-12-05-1])

　コーパスを利用した英語（史）研究はますます盛んになってきており，学界でも当然視されるようになったが，だからこそ利用にあたって注意点を確認しておくことは大事である．主旨はおよそ繰り返しとなるが，今回は英語歴史統語論の概説書を著わした Fischer et al. (14) より，4点を指摘しよう．

(i) there can be tension between what is easily retrieved through corpus searches and what is thought to be linguistically most significant; a historical syntactic case in point involves patterns of co-reference of noun phrases . . . ; these have been largely neglected because they involve information status, which is currently not part of any standard annotation scheme;

(ii) when a data search yields large numbers of hits, there may be a temptation to interpret corpus results merely as numbers, which is a severely reductive approach; in cases of grammaticalization, for example, changes in frequency may act as tell-tale signs . . . , but an exclusive quantitative focus will mean that one is ignoring the changes in meaning and context that form the core of the process;

(iii) the substantial amounts of data that can be collected from a corpus can also blind researchers to the dangers of making generalizations about the language as a whole on the basis of a partial view of it; this is a particularly relevant problem for diachronic research, because we only have very incomplete evidence for the state of the language in any historical period . . . ;

(iv) trying to achieve greater representativness by collecting and comparing data from various corpora can also be tricky: principles guiding text inclusion vary widely, there is little standardization in user interfaces, and they can require a significant time investment to learn to operate.

　この4点を私の言葉で超訳すれば，次のようになる．

　(i) コーパスで遂行しやすい問題が，言語学的には必ずしも意味のある問題ではないかもしれない点に注意すべし
　(ii) 量的な観点を重視する研究には役立ちそうだが，質的な観点が見過ごされてしまう危険性がある
　(iii) 巨大なコーパスであったとしても，完全に representative であるわけではない（いわゆる歴史言語学における "bad-data problem"）
　(iv) コーパス編纂者の前提やインターフェース作成者の意図をつかんだ上で，使用法を心して習熟すべし

　・ Fischer, Olga, Hendrik De Smet, and Wim van der Wurff. A Brief History of English Syntax. Cambridge: CUP, 2017.

	shew 系列	show 系列	総語数
1710--1780	335	1,545	10,480,431
1780--1850	159	3,100	11,285,587
1850--1920	92	5,118	12,620,207

	C12b	C13a	C13b	C14a	Total
N	0 (0.000%)	362 (0.062)	0 (0.000)	52,883 (9.083)	53,245 (9.146)
NEM	11,342 (1.948)	0 (0.000)	3,980 (0.684)	2,344 (0.403)	17,666 (3.034)
NWM	0 (0.000)	58,332 (10.019)	16,173 (2.778)	0 (0.000)	74,505 (12.797)
SEM	40,082 (6.885)	26,722 (4.590)	21,921 (3.765)	31,408 (5.395)	120,133 (20.634)
SWM	1,030 (0.177)	90,400 (15.527)	106,981 (18.375)	108 (0.019)	198,519 (34.098)
SW	1,168 (0.201)	2,610 (0.448)	46,032 (7.907)	30,517 (5.242)	80,327 (13.797)
SE	0 (0.000)	4,043 (0.694)	3,199 (0.549)	30,561 (5.249)	37,803 (6.493)
Total	53,622 (9.210)	182,469 (31.341)	198,286 (34.058)	147,821 (25.390)	582,198 (100.000)

	C12b	C13a	C13b	C14a	Total
N	0 (0.00%)	1 (0.86)	0 (0.00)	7 (6.03)	8 (6.90)
NEM	1 (0.86)	0 (0.00)	5 (4.31)	2 (1.72)	8 (6.90)
NWM	0 (0.00)	9 (7.76)	5 (4.31)	0 (0.00)	14 (12.07)
SEM	4 (3.45)	7 (6.03)	14 (12.07)	7 (6.03)	32 (27.59)
SWM	2 (1.72)	13 (11.21)	17 (14.66)	1 (0.86)	33 (28.45)
SW	3 (2.59)	5 (4.31)	7 (6.03)	2 (1.72)	17 (14.66)
SE	0 (0.00)	2 (1.72)	1 (0.86)	1 (0.86)	4 (3.45)
Total	10 (8.62)	37 (31.90)	49 (42.24)	20 (17.24)	116 (100.00)

representativeness - hellog～英語史ブログ

■ #3967. コーパス利用の注意点 (3) [corpus][methodology][representativeness]

■ #3372. 古英語と中英語の資料の制約について数点のメモ [oe][me][philology][manuscript][statistics][representativeness][methodology][evidence]

■ #2865. 生き残りやすい言語証拠，消えやすい言語証拠――化石生成学からのヒント [philology][writing][manuscript][representativeness][textual_transmission][evidence]

■ #2779. コーパスは英語史研究に使えるけれども [hel_education][corpus][methodology][philology][representativeness]

■ #2598. 古ノルド語の影響力と伝播を探る研究において留意すべき中英語コーパスの抱える問題点 [old_norse][loan_word][me_dialect][representativeness][geography][lexical_diffusion][lexicology][methodology][laeme][corpus]

■ #2584. 歴史英語コーパスの代表性 [representativeness][corpus][methodology][hc][register]

■ #2521. 初期中英語の113種類の "such" の異綴字 [spelling][eme][laeme][corpus][scribe][me_dialect][representativeness]

■ #1739. AmE-BrE Diachronic Frequency Comparer [corpus][ame_bre][web_service][cgi][frequency][representativeness]

■ #1716. shew と show (3) [spelling][corpus][clmet][representativeness]

■ #1280. コーパスの代表性 [corpus][representativeness][variety][idiolect][methodology]

■ #1279. BNC の強みと弱み [bnc][corpus][representativeness]

■ #1264. 歴史言語学の限界と，その克服への道 [methodology][uniformitarian_principle][writing][history][sociolinguistics][laeme][corpus][representativeness][evidence]

■ #1263. The LAEME Corpus の代表性 (2) [laeme][corpus][representativeness]

■ #1262. The LAEME Corpus の代表性 (1) [laeme][corpus][representativeness]

■ #1243. 語の頻度を考慮する通時的研究のために [frequency][corpus][representativeness]

■ #773. PPCMBE と COHA の比較 [corpus][coha][ppcmbe][lmode][adjective][comparison][inflection][representativeness]

■ #568. コーパスの定義と英語コーパス入門 [corpus][link][representativeness]

■ #531. OED の引用データをコーパスとして使えるか [oed][corpus][representativeness]