Analysing Corporate Communications

Detecting the structure of annual financial reports and extracting their contents for further corpus analysis has never been easier. The UCREL Corporate Financial Information Environment (CFIE) project and CASS’ Corporate Communications sub-project has now released the CFIE-FRSE Final Report Structure Extractor: A desktop application to detect the structure of UK Annual Reports and extract the reports’ contents on a section level. This extraction step is vital for the analysis of UK reports which adopt a much more flexible structure than the US equivalent 10-Ks. The CFIE-FRSE tool works as a desktop version of our CFIE-FRSE Web tool

The tool provides batch extraction and analysis of PDF annual report content for English. Crucially, our approach preserves the structure of the underlying report (as represented by the document table of contents) and therefore offers clear delineation between the narrative and financial statement components, as well as facilitating analysis of the narrative component on a schedule-by-schedule basis.

The tool was trained using more than 10,000 UK annual reports and the extraction accuracy exceeds 95% against manual validations and large-sample tests confirm that extracted content varies predictably with economic and regulatory factors.

Accessing the tool:

The tool is available for direct download from GitHub link below:

GitHub Repository:

The CFIE-FSRE tool:

  • Detects the structure of UK Annual Reports by detecting the key section, their start and end pages and extracts the contents.
  • Extracts the text of each section in a plain text file format.
  • Splits the text of each section into sentences using Stanford Sentence Splitter.
  • Provides a Section Classification mechanism to detect the type of the extracted section.
  • Each extracted section will be annotated with a number between 0 and 8 as follows:
Header Type Header
1 Chairman’s Statement
2 CEO Review
3 Corporate Government Report
4 Directors Remuneration Report
5 Business Review
6 Financial Review
7 Operating Review
8 Highlights
0 A section that is none of the above
  • The tool uses Levenshtein distance and other similarity metrics and synonyms for section classification. For example Chairman’s letter and letter to shareholders can still be detected as Type 1 section (Chairman’s Statement).
  • The analysis results of the uploaded files or reports can be found in a subdirectory that follows the pattern of “FileName_Analysis”
    • For example, if you are uploading a file called XYZCompany.pdf, the results will be in subdirectory called XYZCompany_Analysis
    • Analysis outputs are saved in Comma Separated Value (CSV) file format which can be opened using any spreadsheet editor.
    • The tool provides more fields in the Sections_Frequencies.csv file which can be found in the Analysis subdirectory. The new fields are:
    • Start and End pages of each section.
    • Provides the readability of the extracted sections in addition to the whole report using Fog and Flesch readability metrics.
    • Provides keywords frequencies using a preloaded set of keywords for Forward Looking, Positivity, Negativity and Uncertainty.
    • Report Year, this will only work if the year was part of the file name. E.g. “XYZCompany_2015.pdf”
    • Performance Flag: Shows 1 if a section is a performance section, 0 otherwise.
    • Strategy Flag: Shows 1 if a section is a strategic section, 0 otherwise.
    • Booklet Flag: Shows 1, 2 or 3 if a header is a booklet layout, 0 otherwise. Our tool is unable to process booklet annual reports (those reports where two pages are combined into one pdf page). Numbers 1-3 indicates how confident the system is. 1 suspects a booklet layout, 3 definitely a booklet layout
    • The keyword lists (Forward Looking, Uncertainty, Positivity and Negativity) have been updated to eliminate duplicates and encoding errors.

How to run the software:

  • [MS Windows]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run (double click) the runnable.bat file.
  • [Linux/Ubuntu]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command ./
  • [Unix/Mac]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command sh or bash
  • The analysis output directory (a directory for each PDF file) will be found in the PDF directory.
  • Please do not delete any of the files or directories or change their structure.
  • You can add or delete PDF files from the PDF directory and you can also edit the userKeywords.txt to include your own keyword list, simply empty the file and insert one keyword (or keyphrase) on each line.

Related Papers:

  • El-Haj, Mahmoud, Rayson, Paul, Young, Steven, and Walker, Martin. “Detecting Document Structure in a Very Large Corpus of UK Financial Reports”. In The 9th edition of the Language Resources and Evaluation Conference, 26-31 May 2014, Reykjavik, Iceland.
    Available at:


The tool is available under the GNU General Public License.

More about the CFIE research:

For more information about the projects’ output, web-tools, resources and contact information, please visit our page below: