#956. COCA N-Gram Search

2011-12-09

　##953,954,955 の記事で，最近公開された COCA ( Corpus of Contemporary American English ) の n-gram データベースを利用してみた．COCA に現われる 2-grams, 3-grams, 4-grams, 5-grams について，それぞれ最頻約100万の表現を羅列したデータベースで，手元においておけば，工夫次第で COCA のインターフェースだけでは検索しにくい共起表現の検索が可能となる．
　ただし，各 n-gram のデータベースは，数十メガバイトの容量のテキストファイルで，直接検索するには重たい．そこで，SQLite データベースへと格納し，SQL 文による検索が可能となるように検索プログラムを組んだ．以下は，検索結果の最初の10行だけを出力する CGI である．

　以下，使用法の説明．テーブル名は n-gram の "n" の値に応じて，"two", "three", "four", "five" とした．ちなみに，1-grams のデータベース（事実上，COCA に3回以上現われる語の頻度つきリスト）も付随しており，こちらもテーブル名 "one" としてアクセス可能にした．フィールドは，全テーブルに共通して "freq" （頻度）があてがわれているほか，"n" の値に応じて，"word1" から "word5" までの語形 (case-sensitive) と，"pos1" から "pos5" までの COCA の語類標示タグが設定されている．select 文のみ有効．以下に，典型的な検索式を例として載せておく．

# 1-grams で，前置詞を頻度順に取り出す（ただし，case-sensitive なので再集計が必要）
select * from one where pos1 like "i%" order by freq desc;

# 2-grams で，ハンサムなものを頻度順に取り出す
select * from two where word1 = "handsome" and pos1 = "jj" and pos2 like "nn_" order by freq desc;

# 2-grams で，"absolutely (adj.)" で強調される形容詞を頻度順に取り出す（[2011-03-12-1]の記事「#684. semantic prosody と文法カテゴリー」を参照）
select * from two where word1 = "absolutely" and pos2 = "jj" order by freq desc;

# 3-grams で，高頻度の as ... as 表現を取り出す
select * from three where word1 = "as" and word3 = "as" order by freq desc;

# 4-grams で，高頻度の from ... to ... 表現を取り出す
select * from four where word1 = "from" and pos1 = "ii" and word3 = "to" and pos3 = "ii" order by freq desc;

# 5-grams で，死因を探る； "die of" と "die from" の揺れを観察する
select * from five where word1 in ("die", "dies", "died", "dying") and pos1 like "vv%" and word2 in ("of", "from") and pos2 like "i%" order by word3;

　n-gram データベースを最大限に使いこなすには，このようにして得られた検索結果をもとにさらに条件を絞り込んだり，複数の検索結果を付き合わせるなどの工夫が必要だろう．

Referrer (Inside): [2016-09-07-1] [2015-09-07-1] [2013-08-11-1] [2012-12-08-1]

[ ツイート | 固定リンク | 印刷用ページ ]

#956. COCA N-Gram Search[cgi][web_service][coca][corpus][collocation][n-gram]