Analysing Corporate Communications

Detecting the structure of annual financial reports and extracting their contents for further corpus analysis has never been easier. The UCREL Corporate Financial Information Environment (CFIE) project and CASS’ Corporate Communications sub-project has now released the CFIE-FRSE Final Report Structure Extractor: A desktop application to detect the structure of UK Annual Reports and extract the reports’ contents on a section level. This extraction step is vital for the analysis of UK reports which adopt a much more flexible structure than the US equivalent 10-Ks. The CFIE-FRSE tool works as a desktop version of our CFIE-FRSE Web tool

The tool provides batch extraction and analysis of PDF annual report content for English. Crucially, our approach preserves the structure of the underlying report (as represented by the document table of contents) and therefore offers clear delineation between the narrative and financial statement components, as well as facilitating analysis of the narrative component on a schedule-by-schedule basis.

The tool was trained using more than 10,000 UK annual reports and the extraction accuracy exceeds 95% against manual validations and large-sample tests confirm that extracted content varies predictably with economic and regulatory factors.

Accessing the tool:

The tool is available for direct download from GitHub link below:

GitHub Repository:

The CFIE-FSRE tool:

  • Detects the structure of UK Annual Reports by detecting the key section, their start and end pages and extracts the contents.
  • Extracts the text of each section in a plain text file format.
  • Splits the text of each section into sentences using Stanford Sentence Splitter.
  • Provides a Section Classification mechanism to detect the type of the extracted section.
  • Each extracted section will be annotated with a number between 0 and 8 as follows:
Header Type Header
1 Chairman’s Statement
2 CEO Review
3 Corporate Government Report
4 Directors Remuneration Report
5 Business Review
6 Financial Review
7 Operating Review
8 Highlights
0 A section that is none of the above
  • The tool uses Levenshtein distance and other similarity metrics and synonyms for section classification. For example Chairman’s letter and letter to shareholders can still be detected as Type 1 section (Chairman’s Statement).
  • The analysis results of the uploaded files or reports can be found in a subdirectory that follows the pattern of “FileName_Analysis”
    • For example, if you are uploading a file called XYZCompany.pdf, the results will be in subdirectory called XYZCompany_Analysis
    • Analysis outputs are saved in Comma Separated Value (CSV) file format which can be opened using any spreadsheet editor.
    • The tool provides more fields in the Sections_Frequencies.csv file which can be found in the Analysis subdirectory. The new fields are:
    • Start and End pages of each section.
    • Provides the readability of the extracted sections in addition to the whole report using Fog and Flesch readability metrics.
    • Provides keywords frequencies using a preloaded set of keywords for Forward Looking, Positivity, Negativity and Uncertainty.
    • Report Year, this will only work if the year was part of the file name. E.g. “XYZCompany_2015.pdf”
    • Performance Flag: Shows 1 if a section is a performance section, 0 otherwise.
    • Strategy Flag: Shows 1 if a section is a strategic section, 0 otherwise.
    • Booklet Flag: Shows 1, 2 or 3 if a header is a booklet layout, 0 otherwise. Our tool is unable to process booklet annual reports (those reports where two pages are combined into one pdf page). Numbers 1-3 indicates how confident the system is. 1 suspects a booklet layout, 3 definitely a booklet layout
    • The keyword lists (Forward Looking, Uncertainty, Positivity and Negativity) have been updated to eliminate duplicates and encoding errors.

How to run the software:

  • [MS Windows]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run (double click) the runnable.bat file.
  • [Linux/Ubuntu]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command ./
  • [Unix/Mac]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command sh or bash
  • The analysis output directory (a directory for each PDF file) will be found in the PDF directory.
  • Please do not delete any of the files or directories or change their structure.
  • You can add or delete PDF files from the PDF directory and you can also edit the userKeywords.txt to include your own keyword list, simply empty the file and insert one keyword (or keyphrase) on each line.

Related Papers:

  • El-Haj, Mahmoud, Rayson, Paul, Young, Steven, and Walker, Martin. “Detecting Document Structure in a Very Large Corpus of UK Financial Reports”. In The 9th edition of the Language Resources and Evaluation Conference, 26-31 May 2014, Reykjavik, Iceland.
    Available at:


The tool is available under the GNU General Public License.

More about the CFIE research:

For more information about the projects’ output, web-tools, resources and contact information, please visit our page below:


Recent Research into CEO Compensation

On Wednesday 18th January, the CFA Society United Kingdom (CFA UK) hosted a breakfast meeting at Innholders’ Court (London, EC4R 2RH) to discuss findings of a recently completed CFA UK-funded research project examining CEO compensation across the FTSE-350 from 2003 to 2015. CFA UK represents the interests of around 12,000 investment professionals in the UK and the report received widespread press coverage over the Christmas period including coverage from the BBC, The Times, The Guardian, and Financial Times.

The report (co-authored with Dr Weijia Li, Lancaster University Management School and available to download at: contributes to the executive remuneration debate by providing independent statistical evidence highlighting a limited association between economic value creation and executive pay.

Among other findings, the research suggests that despite relentless pressure from regulators and governance reformers over the last two decades to ensure closer alignment between executive pay and performance, the association between CEO pay and fundamental value creation in the UK remains weak at best.

At the heart of the problem is the disconnect between the performance measures that are widely employed in executive remuneration contracts such as earnings per share (EPS) growth and total shareholder return (TSR), and the extent to which these metrics provide reliable information on periodic value creation. Economic theory clearly demonstrates that EPS growth and TSR provide poor proxies for value creation; and this insight is confirmed in the data, with correlations below 30% documented for these measures and more sophisticated value-based performance metrics such as residual income and economic profit that include an explicit charge for invested capital.

The work also reveals that mandatory pay-related annual report disclosures designed to enhance the transparency of executive remuneration arrangements have become increasingly complicated and hard to read (measured by the Fog index), to the extent that even relatively sophisticated consumers of firms’ published reports struggle to identify basic information such as total compensation paid to the CEO during the reporting period.

Attendees at the event comprised representatives from a range of City institutions including CFA UK, The Investment Association, SVM Asset Management, RPMI Railpen, Schroders, PIRC, Aberdeen Asset Management, JP Morgan Asset Management, Kepler Cheuvreux, Legal & General Investment Management, Fidelity International, Willis Towers Watson, Pensions and Lifetime Savings Association.

Will Goodhart (Chief Executive, CFA UK) welcomed attendees and Natalie Winterfrost (Aberdeen Asset Management) provided context for the research. After a brief summary of the research purpose, methodology and main findings, plus follow-up comments from steering committee members Prof Brian Main (Edinburgh University), James Cooke (SVM Asset Management), and Alasdair Wood (Willis Towers Watson), attendees engaged in a lively discussion concerning the report’s conclusions and their implications for executive compensation policy in the UK. The discussions will help CFA UK to formulate its engagement strategy with companies and institutional investors to improve the degree of alignment between pay and value generation.

Mahmoud El-Haj has recently joined CASS working on the ESRC funded project “Understanding Corporate Communications”

mahmoudThe project is a comprehensive analysis of the form, content and impact of communications between large, publicly traded corporations and their key stakeholder groups concerning the following three key aspects of corporate governance: i) compliance with governance requirements and recommendations (e.g. The Combined Code in the UK); ii) executive remuneration; and iii) senior management turnover.

Mahmoud is a Senior Research Associate at Lancaster University. His main research interests are natural language processing, corpus linguistics, information extraction, machine learning and computational linguistics. In his research he worked with multidisciplinary multilingual big data including financial narratives, news articles, medical journals, and data from social science and humanities. Mahmoud is also working with the School of Computing and Communications at Lancaster University on a project funded by UCREL working on VardSourcing and SenseSourcing – the use of crowdsourcing to build lexicons and check spelling variation in historical data.

Recent publications and presentations related to this project include:

El-Haj, M., Rayson, P., Young, S., and Walker, M.. “Detecting Document Structure in a Very Large Corpus of UK Financial Reports”. In The 9th edition of the Language Resources and Evaluation Conference, 26-31 May 2014, Reykjavik, Iceland.

Athanasakou, V., El-Haj, M., Rayson, P., Young, S., and Walker, M.. “Computer-based Analysis of the Strategic Content of UK Annual Report Narratives”. In American Accounting Association Annual Meeting, August 2-6, 2014, Atlanta, USA.

IR Group Glasgow University, 2015 / School of Computing Science: Analysing UK Annual Report Narratives using Text Analysis and Natural Language Processing, Glasgow, Scotland.

Bangor University: PhD Training Session at Bangor Business School: Analysing Annual Report Narratives (co presented with Steve Young, LUMS, Lancaster University), 2014, Bangor, Wales.

The 8th LSE/LUMS/MBS Conference 2014 / London School of Economics: Natural Language Processing of UK Annual Report Narratives (co presented with Paul Rayson, SCC, Lancaster University), London, England.