R-Grammi
R-GRAMMI is a simple n-grams/ tf-idf text indexer and feature extractor written in Ruby. It recurses through a provided corpus (a directory of text-files) and produces an output file in ARFF (Attribute-Relation File Format) containing the class of the document and its n-grams with their frequencies per document.
Each document can be assigned to exactly one class (e.g. the assignment can be implicitly with the folder structure of the corpus directory). The indexer also allows optional stemming of the documents, and allows frequency thresholding for lower and upper bounds. The output file can e.g. be used for further in the WEKA machine learning toolkit, e.g. for training a classifier or as the basis of a text retrieval system.
Options & Use
- provide upper- and lower-bounds for the percentage of documents a certain n-gram must appear in.
- choose the n in n-grams
- use porter stemming or not (English is the only supported stemming language)
- use stopwords or not (stopword files containing common stopwords for Deutsch, English, 日本語 are included)
- name of the output file
- name of the relation in the output file
- override method to retrieve class assignment
These options can be set by using the provided setter methods in the indexer class. For example usage of the indexer look at the provided Rakefile:
# rake task to build index
task :buildindex do
# mixin to overwrite get_class method
# get class assignment implicitly with folder structure (first hierarchy level)
Indexer.class_eval do
def get_class f
f.split('/')[-2]
end
end
i = Indexer.new
# working directory, root directory of corpus
i.wdir = Dir.getwd + '/corpora/my_text_corpus'
# n in ngrams
i.ngrams = 3
# use stemming?
i.stemming = true
# use stopwords?
i.stopwording = true
# upper percentage of docs in which a certain ngram has to appear
i.upperbound = 0.4
# lower percentage of docs in which a certain ngram has to appear
i.lowerbound = 0.01
# name for the relation in the output file
i.name = 'output'
# file containing stopwords
i.stw_file = 'stop_words_en.txt'
# name of index file
i.out_file = Dir.getwd + '/output.arff'
# create temporary data files containing n-grams, frequencies
i.process_dir
# build the output arff file from temporary data
i.build_arff
# delete temporary files
i.cleanup
end
Corpora
The program has been tested with the 20 Newsgroups corpus. More corpora can be found at infochimps or the machine learning repository.