A project to provide for unsupervised classification of large numbers of substantial (book-length) documents to the appropriate Book Industry Standards and Classifications Subject Headings (BISAC codes) used throughout publishing and information science. Named in honor of the deeply knowledgeable bookstore employees and innovative information technologists at Borders Store #1 at 316 S. State Street in Ann Arbor, Michigan, which offered so many pioneering and new wonderful aspects of the modern bookbuying experience.
High level requirements:
* freely scaling unsupervised classification of large collections of arbitrarily chosen documents of substantial lengths (somewhat like Rodents of Unusual Size, but in this case known more often as Books) into the BISAC subject categories. http://
* free
* open source
* updatable (BISAC categories change every year)
* configurable (not everyone cares equally about all categories)
* extensible (BISAC isn't enough for everyone)
* interoperable (plays nicely with others, MARC, etc.)
Immediate practical requirements (for Nimble Combinatorial Publishing, which is building version 0.1):
* "as good as a graduate student" (or Borders clerk!) classification accuracy for documents that are supplied without metadata, i.e. 85% or better (need not be perfect, but should be able to get better over time)
* Linux, specifically Ubuntu 10.04 LTS, Apache, Solr/nutch, Python
* 2012 BISG categories
* accepts txt and, ultimately, html, epub, pdf format documents
* documents of up to 5 million words right off the bat (NCP already has some individual user-generated Explorers of ~2M)
* works adequately with initially small collections of documents (000s) and a mostly sparse tree.
Architectural vision:
* use API crawls of metadata and open source text to assemble "text buckets" for each BISAC node
* use inexpensive text mining software for proof of concept (tmsk)
Development team vision:
* For this to work beyond my own needs I am going to need some strong leaders from academia, library, and commercial backgrounds who have experience with and access to very large corpora.
Project information
- Maintainer:
- Fred Zimmerman
- Driver:
- Not yet selected
- Licence:
- Apache Licence, I don't know yet
View full history Series and milestones
trunk series is the current focus of development.
All packages Packages in Distributions
-
ann source package in Xenial
Version 1.1.2+doc-5 uploaded -
ann source package in Trusty
Version 1.1.2+doc-4.1 uploaded -
ann source package in Mantic
Version 1.1.2+doc-9 uploaded -
ann source package in Lunar
Version 1.1.2+doc-9 uploaded -
ann source package in Kinetic
Version 1.1.2+doc-7build1 uploaded