A Simple Text Categorizer

By: Jeff Clark    Date: Sat, 29 Jul 2006

I've created a simple tool for categorizing text. I started by defining a topic hierarchy. I was hoping to reuse an existing one but had no luck finding one I thought was suitable. The end of this entry shows what I came up with.

For each topic in the hierarchy I grabbed the text in the corresponding wikipedia entry and produced from it a word-based feature vector indicative of the topic. This excluded stop words (a, the, and, or ,of, ...) as well as any words which appeared in more than 30% of the topic entries. I also manually tuned it by passing through all the topics and deleting words that I thought were too generic.

I wrote a little tool that calculates a similarity score for each node in the topic tree. The score for a node with descendants will be the maximum score of all its' descendants. The scores are used to assign the best representatative topic at each level for a given set of text. So, for example, I can input a weblog post and automatically assign it to Arts, Economy, Recreation, Science, Society, or Technology at the top level. If the score is very low for all the topics then it gets labeled as None.

This is a fairly simplistic approach but seems to work pretty well. I will be presenting some results based on this categorizer in future posts over the next little while.

Topic Hierarchy

1 Arts
    2 Design Arts
        3 Architecture
        3 Fashion
        3 Crafts
        3 Sculpture
    2 Performing and Literary Arts
        3 Theater
        3 Music
        3 Motion Pictures
        3 Dance
        3 Literature
    2 Visual Arts
        3 Painting
        3 Photography
        3 Drawing
        3 Animation
        3 Comics
        3 Television
1 Economy
    2 Business
        3 Employment
        3 Marketing
        3 Intellectual Property
        3 Management
    2 Industry
        3 Manufacturing
        3 Services
        3 Trade
        3 Utilities
    2 Finance
        3 Accounting
        3 Investment
        3 Taxation
        3 Banking
1 Recreation
    2 Sports
        3 Racing
        3 Basketball
        3 Hockey
        3 Football
        3 Soccer
    2 Games
        3 Computer Games
        3 Game of Skill
        3 Strategy Game
        3 Game of Chance
    2 Outdoors
        3 Fishing
        3 Hunting
        3 Sailing
1 Science
    2 Natural Science
        3 Mathematics
        3 Biology
        3 Medicine
        3 Physics
        3 Chemistry
        3 Astronomy
    2 Social Science
        3 Education
        3 History
        3 Geography
        3 Linguistics
        3 Psychology
1 Society
    2 Family
        3 Interpersonal relationship
        3 Marriage
        3 Children
        3 House
    2 Government
        3 Politics
        3 Law
        3 Crime
        3 Military
    2 Daily Life
        3 Religion
        3 Food
        3 Holidays
        3 Pets
1 Technology
    2 Information Technology
        3 Computers
        3 Software
        3 World Wide Web
        3 Internet
        3 Communications
    2 Engineering
        3 Civil Engineering
        3 Aerospace Engineering
        3 Electrical Engineering
        3 Mechanical Engineering
        3 Materials Engineering
        3 Industrial Engineering

 


Radial Treemap for US Auto Sales
Blog
Boing Boing Analysis - Part 5