SEC Document Clustering

David Masad & Sanjay Nayar

Each dot below represents a corporation. The closer two dots are in a given year, the more similar the text of the Management Discussion and Analysis sections in their SEC 10-K filings for that year. Click-drag to select companies in one year, and the same companies will be highlighted in all other years.

Text distance between firms' 10-K filings, 1995-2010


Every American public corporation is required to file a 10-K report annually with the Securities and Exchange Commission. The structure of these reports is mandated by statute, and includes Item 7: Management's Discussion and Analysis describing the firms' present, past and future operations and performance.

Kogan et al. have demonstrated that language used in Item 7 is salient to company operations, and can be used to predict the volatility of its stock. We build on this, and hypothesize that the similarity of Item 7 text between firms indicates an underlying connection within the market.


We analyze 10-K filings for the firms of the S&P 1,500, a combination of large-, mid- and small-cap companies providing a good sample of the entire market. Electronic filings from 1993 to the present are available online from the SEC's EDGAR system. We downloaded the files using Amazon EC2 instances and stored them on Amazon S3. We parsed the files in parallel using Python code running on PiCloud, and analyzed the resulting text using a mixture of Apache Mahout running locally and on Amazon EMR, and custom Python code on PiCloud. This visualization contains the subset of firms filing 10-Ks, excluding 10-K/A and other variant forms.

Our analysis treated the text as a bag of words, looking at unigrams and bigrams. Document distance was computed using Jaccard similarity. The visualization above was created via dimensionality reduction of the distance matrix for each year. Points are placed on two arbitrary dimensions, on per company, such that the euclidian distance between any two points approximates the distance between the companies' documents. The data was processed in Python and R, and visualized using D3.js.


We note that the 'U' shaped projection emerges as the number of firms in the data grows. This indicates a robust overall structure, particularly since the shape is consistent despite individual firms' changing positions relative to one another. We also note that clusters of firms within a specific year's projection tend to appear in similar positions in years immediately preceding or following, and to grow apart in further years. This strongly suggests that firm position is not random, since otherwise we would not see cluster consistency between years. However, the fact that firm clusters are not constant suggests that the clustering reflects specific firm and market conditions, rather than inherent similarity in sector or other largely-fixed attribute.