Introduction
This page is concerned with the details of a virtual machine image for Natural Language Processing and Computational Linguistics. Currently this is a proposal. The machine itself needs to be created once we have identified the major packages to be included in the first version. Initially this image should work with VMware player, available from http://www.vmware.com/, but we should also consider providing a version that can run in the open source VirtualBox, from http://www.virtualbox.org, as some may find this works better for them.
The Virtual Machine
Once configured, a link to the virtual machine image (or test versions) can be provided here. This should have restricted access if it includes software that cannot be redistributed to third parties. We could have a second "open" version that excludes packages and corpora that have restrictive licences.
Proposed Software
Links included only when the package is not already part of the standard Ubuntu repositories.
Hosted operating system
- Ubuntu 10.04 LTS
- Long Term Service Release of Ubuntu, ideally configured for automatic updates, and simplified mounting of host file systems and networked file-space. Configured to use Gnome, but perhaps also make available a version with a more light-weight desktop, such as LXDE.
Software available from the Ubuntu repositories
- R
- Statistics Package, with full complement of extensions for NLP. Some extensions may be missing from the Ubuntu repositories. If this is the case, they should be documented below.
- Emacs-gtk
- Comprehensive editor and IDE, with GTK support.
- Eclipse
- Software development IDE.
- [Notepad++]
- [Graphical text editor, but not Linux native, requires Wine. Look into alternatives.].
- OpenJava
- Full JDK based on IcedTea.
- SWI Prolog
- Near sicstus clone, with XPCE graphical interface.
- Python
- Interpreted programming language, for NLTK.
- Perl
- Scripting language
- Chromium
- Lightweight web browser.
- Remmina
- Allows access to Microsoft Windows and X desktops via RDP, VNC or NX protocols. (Latest version to be installed by adding a PPA entry.)
- And …
- add your suggestions here.
Software not thought to be included in the Ubuntu repositories
We need to double check whether any of these packages are actually in the Ubuntu repositories, or Ubuntu/Debian compatible private repositories (PPA's), which would then allow them to be managed using the Synaptic/apt-get package management system.
- NLTK
- Python based natural language toolkit. http://www.nltk.org
- jLSA and SVDlib
- Tools for performing Latent Semantic Analysis. ?link? No longer available? or alternatives, see: http://en.wikipedia.org/wiki/Latent_semantic_analysis. Another Java LSA package is available from http://code.google.com/p/airhead-research/downloads/list which includes a Java port of the SVD libraries. See http://code.google.com/p/airhead-research/wiki/LatentSemanticAnalysis for more details.
- Random Indexing
- LSI/PLSI-like dimensionality reduction Semantic Vectors. See http://code.google.com/p/semanticvectors/
- Semantic Engine
- Recall search engine C++ library, supports document similarity clustering. http://code.google.com/p/semantic-engine/
- Gate
- NLP infrastructure. http://gate.ac.uk/
- Uima
- NLP infrastructure. http://uima.apache.org/
- Rouge
- Evaluation metrics for summarisation. http://berouge.com/default.aspx
- Cluto
- Clustering software. http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview and visualisation http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/download
- Brill Tagger
- The original http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/parsing/taggers/brill/0.html and the GPoSTTL Enhanced Brill Tagger. http://sourceforge.net/projects/gposttl/
- Stanford Software
- Including the Stanford Parser etc. http://nlp.stanford.edu/software/lex-parser.shtml
- Other software
- Packages mentioned at http://nlp.stanford.edu/links/statnlp.html
- Corpora
- Smaller corpora for NLP and CL, ?links?, and documentation for adding larger corpora, or corpora that have a restrictive licence. See the resources page.
- And …
- add your suggestions here.