Registered by Fred Zimmerman

A project to provide for unsupervised classification of large numbers of substantial (book-length) documents to the appropriate Book Industry Standards and Classifications Subject Headings (BISAC codes) used throughout publishing and information science. Named in honor of the deeply knowledgeable bookstore employees and innovative information technologists at Borders Store #1 at 316 S. State Street in Ann Arbor, Michigan, which offered so many pioneering and new wonderful aspects of the modern bookbuying experience.

High level requirements:

* freely scaling unsupervised classification of large collections of arbitrarily chosen documents of substantial lengths (somewhat like Rodents of Unusual Size, but in this case known more often as Books) into the BISAC subject categories. http://www.bisg.org/activities-programs/activity.php?n=d&id=47&cid=20
* free
* open source
* updatable (BISAC categories change every year)
* configurable (not everyone cares equally about all categories)
* extensible (BISAC isn't enough for everyone)
* interoperable (plays nicely with others, MARC, etc.)

Immediate practical requirements (for Nimble Combinatorial Publishing, which is building version 0.1):

* "as good as a graduate student" (or Borders clerk!) classification accuracy for documents that are supplied without metadata, i.e. 85% or better (need not be perfect, but should be able to get better over time)
* Linux, specifically Ubuntu 10.04 LTS, Apache, Solr/nutch, Python
* 2012 BISG categories
* accepts txt and, ultimately, html, epub, pdf format documents
* documents of up to 5 million words right off the bat (NCP already has some individual user-generated Explorers of ~2M)
* works adequately with initially small collections of documents (000s) and a mostly sparse tree.

Architectural vision:

* use API crawls of metadata and open source text to assemble "text buckets" for each BISAC node
* use inexpensive text mining software for proof of concept (tmsk)

Development team vision:

* For this to work beyond my own needs I am going to need some strong leaders from academia, library, and commercial backgrounds who have experience with and access to very large corpora.

Project information

Maintainer:
Fred Zimmerman
Driver:
Not yet selected
Licence:
Apache Licence, I don't know yet

RDF metadata

View full history Series and milestones

trunk series is the current focus of development.

All packages Packages in Distributions

Get Involved

  • warning
    Report a bug
  • warning
    Ask a question
  • warning
    Help translate

Downloads

State Street Book Shelver does not have any download files registered with Launchpad.