Analysing Corporate Communications

Detecting the structure of annual financial reports and extracting their contents for further corpus analysis has never been easier. The UCREL Corporate Financial Information Environment (CFIE) project and CASS’ Corporate Communications sub-project has now released the CFIE-FRSE Final Report Structure Extractor: A desktop application to detect the structure of UK Annual Reports and extract the reports’ contents on a section level. This extraction step is vital for the analysis of UK reports which adopt a much more flexible structure than the US equivalent 10-Ks. The CFIE-FRSE tool works as a desktop version of our CFIE-FRSE Web tool

The tool provides batch extraction and analysis of PDF annual report content for English. Crucially, our approach preserves the structure of the underlying report (as represented by the document table of contents) and therefore offers clear delineation between the narrative and financial statement components, as well as facilitating analysis of the narrative component on a schedule-by-schedule basis.

The tool was trained using more than 10,000 UK annual reports and the extraction accuracy exceeds 95% against manual validations and large-sample tests confirm that extracted content varies predictably with economic and regulatory factors.

Accessing the tool:

The tool is available for direct download from GitHub link below:

GitHub Repository:

The CFIE-FSRE tool:

  • Detects the structure of UK Annual Reports by detecting the key section, their start and end pages and extracts the contents.
  • Extracts the text of each section in a plain text file format.
  • Splits the text of each section into sentences using Stanford Sentence Splitter.
  • Provides a Section Classification mechanism to detect the type of the extracted section.
  • Each extracted section will be annotated with a number between 0 and 8 as follows:
Header Type Header
1 Chairman’s Statement
2 CEO Review
3 Corporate Government Report
4 Directors Remuneration Report
5 Business Review
6 Financial Review
7 Operating Review
8 Highlights
0 A section that is none of the above
  • The tool uses Levenshtein distance and other similarity metrics and synonyms for section classification. For example Chairman’s letter and letter to shareholders can still be detected as Type 1 section (Chairman’s Statement).
  • The analysis results of the uploaded files or reports can be found in a subdirectory that follows the pattern of “FileName_Analysis”
    • For example, if you are uploading a file called XYZCompany.pdf, the results will be in subdirectory called XYZCompany_Analysis
    • Analysis outputs are saved in Comma Separated Value (CSV) file format which can be opened using any spreadsheet editor.
    • The tool provides more fields in the Sections_Frequencies.csv file which can be found in the Analysis subdirectory. The new fields are:
    • Start and End pages of each section.
    • Provides the readability of the extracted sections in addition to the whole report using Fog and Flesch readability metrics.
    • Provides keywords frequencies using a preloaded set of keywords for Forward Looking, Positivity, Negativity and Uncertainty.
    • Report Year, this will only work if the year was part of the file name. E.g. “XYZCompany_2015.pdf”
    • Performance Flag: Shows 1 if a section is a performance section, 0 otherwise.
    • Strategy Flag: Shows 1 if a section is a strategic section, 0 otherwise.
    • Booklet Flag: Shows 1, 2 or 3 if a header is a booklet layout, 0 otherwise. Our tool is unable to process booklet annual reports (those reports where two pages are combined into one pdf page). Numbers 1-3 indicates how confident the system is. 1 suspects a booklet layout, 3 definitely a booklet layout
    • The keyword lists (Forward Looking, Uncertainty, Positivity and Negativity) have been updated to eliminate duplicates and encoding errors.

How to run the software:

  • [MS Windows]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run (double click) the runnable.bat file.
  • [Linux/Ubuntu]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command ./
  • [Unix/Mac]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command sh or bash
  • The analysis output directory (a directory for each PDF file) will be found in the PDF directory.
  • Please do not delete any of the files or directories or change their structure.
  • You can add or delete PDF files from the PDF directory and you can also edit the userKeywords.txt to include your own keyword list, simply empty the file and insert one keyword (or keyphrase) on each line.

Related Papers:

  • El-Haj, Mahmoud, Rayson, Paul, Young, Steven, and Walker, Martin. “Detecting Document Structure in a Very Large Corpus of UK Financial Reports”. In The 9th edition of the Language Resources and Evaluation Conference, 26-31 May 2014, Reykjavik, Iceland.
    Available at:


The tool is available under the GNU General Public License.

More about the CFIE research:

For more information about the projects’ output, web-tools, resources and contact information, please visit our page below:


Textual analysis training for European doctoral researchers in accounting

Professor Steve Young (Lancaster University Management School and PI of the CASS ESRC funded project Understanding Corporate Communications) was recently invited to the 6th Doctoral Summer Program in Accounting Research (SPAR) to deliver sessions specializing in textual analysis of financial reporting. The invitation reflects the increasing interest in narrative reporting among accounting researchers.

The summer program was held at WHU – Otto Beisheim School of Management (Vallendar, Germany) 11-14 July, 2016.

Professor Young was joined by Professors Mary Bath (Stanford University) and Wayne Landsman (University of North Carolina, Chapel Hill), whose sessions covered a range of current issues in empirical financial reporting research including disclosure and the cost of capital, fair value accounting, and comparative international financial reporting. Students also benefitted from presentations by Prof. Dr. Andreas Barckow (President, Accounting Standards Committee of Germany) and Prof. Dr. Sven Hayn (Partner, EY Germany).

The annual SPAR training event was organised jointly by the Ludwig Maximilian University of Munich School of Management and the WHU – Otto Beisheim School of Management. The programme attracts the top PhD students in accounting from across Europe with the aim of introducing them to cutting-edge theoretical, methodological, and practical issues involved in conducting high-quality financial accounting research. This year’s cohort comprised 31 carefully selected students from Europe’s leading business schools.

Professor Young delivered four sessions on textual analysis. Sessions 1 & 2 focused on the methods currently applied in accounting research and the opportunities associated with applying more advanced approaches from computational linguistics and natural language processing. The majority of extant work in mainstream accounting research relies on bag-of-words methods (e.g., dictionaries, readability, and basic machine learning applications) to study the properties and usefulness of narrative aspects of financial communications; significant opportunities exist for accounting researchers applying more mainstream textual analysis methods including part of speech tagging, semantic analysis, topic models, summarization, text mining, and corpus methods.

Sessions 3 & 4 reviewed the extant literature on automated textual analysis in accounting and financial communication. Session 3 concentrated on earnings announcements and annual reports. Research reveals that narrative disclosures are incrementally informative beyond quantitative data for stock market investors, particularly in circumstances where traditional accounting data provide an incomplete picture of firm performance and value. Nevertheless, evidence also suggests that management use narrative commentaries opportunistically when the incentives to do so are high.  Session 4 reviewed research on other aspects of financial communication including regulatory information [e.g., surrounding mergers and acquisitions (M&A) and initial public offerings (IPOs)], conference calls, analysts’ reports, financial media, and social media. Evidence consistently indicates that financial narratives contain information that is not captured by quantitative results.

Slides for all four sessions are available here.

The event was a great success. Students engaged actively in all sessions (including presentations and discussions of published research using textual analysis methods). New research opportunities were explored involving the analysis of new financial reporting corpora and the application of more advanced computational linguistics methods. Students also received detailed feedback from faculty on their research projects, a significant number of which involved application of textual analysis methods. Special thanks go to Professor Martin Glaum and his team at WHU for organizing and running the summer program.

40th Anniversary of the Language and Computation Group


Recently I was given the chance to attend the 40th anniversary of the Language and Computation (LAC) group at The University of Essex. As an Essex alumni I was invited to present my work with CASS on Financial Narrative Processing (FNP) part of the ESRC funded project . Slides are available online here.

The event celebrates 40 years of the Language and Computation (LAC) group: an interdisciplinary group created to foster interaction between researchers working on Computational Linguistics within the University of Essex.

There were 16 talks by Essex University alumnus and connections including Yorick Wilks, Patrick Hanks, Stephen Pulman and Anne de Roeck.

The two day workshop started with Doug Arnold from the Department of Language and Linguistics at Essex. He started by presenting the history and the beginning of the LAC group which started with the arrival of Yorick Wilks in the late 70s and others from Language and Linguistics, this includes Stephen Pulman, Mike Bray, Ray Turner and Anne de Roeck. According to Doug the introduction of the cognitive studies center and the Eurotra project in the 80s led to the introduction of the Computational Linguistics MA paving the way towards the emergence of Language and Computation. Something I always wondered about.

The workshop referred to the beginning of some of the most influential conferences and associations in computational linguistics such as CoLing, EACL and ESSLLI. It also showed the influence of the world events around that period and the struggle researchers and academics had to go through, especially during the cold war and the many university crises around the UK during the 80s and the 90s. Having finished my PhD in 2012 it never crossed my mind how difficult it would have been for researchers and academics to progress under such intriguing situations during that time.

Doug went on to point out how the introduction of the World Wide Web in the mid 90s and the development of technology and computers helped to rapidly advance and reshape the field. This helped in closing the gap between Computation and Linguistics and the problem of field identity between Computational Linguists coming from a Computing or Linguistics background. We now live surrounded by rapid technologies and solid networks infrastructure which makes communications and data processing a problem no more. I was astonished when Stephen Pulman mentioned how they used to wait a few days for the only machine in the department to compile a few lines-of-code of LISP.

The presence of Big Data processing in 2010 and the rapid need for resourcing, crowd-sourcing and interpreting big data added more challenges but interesting opportunities to computational linguists. Something I very much agree with considering the vast amount of data available online these days.

Doug ended his talk by pointing out that in general Computational Linguistics is a difficult field; computational linguists are expected to be experts in many areas, concluding that training computational linguists is deemed to be a challenging and difficult task. As a computational linguist this rings a bell. For example, and as someone from a computing background, I find it difficult to understand how part of speech taggers work without being versed in the grammatical aspect of the language of study.

Doug’s talk was followed by compelling and very informative talks from Yorick Wilks, Mike Rosner and Patrick Hanks.

Yorick opened with “Linguistics is still an interesting topic” narrating his experience in moving from Linguistics towards Computing and the challenge imposed by the UK system compared to other countries such as France, Russia and Italy where Chomsky had little influence. This reminded me of Peter Norivg’s response to Chomsky’s criticism of empirical theory where he said and I quote: “I think Chomsky is wrong to push the needle so far towards theory over facts”.

In his talk, Yorick referred to Lancaster University and the remarkable work by Geoffrey Leech and the build up of the CLAWS tagger, which was one of the earliest statistical taggers to ever reach the USA.

“What is meaning?” was Patrick Hanks talk’s opening and went into discussing word ambiguity saying: “most words are hopelessly ambiguous!”.  Patrick briefly discussed the ‘double helix’ rule system or the Theory of Norms and Exploitations (TNE), which enables creative use of language when speakers and writers make new meanings, while at the same time relying on a core of shared conventions for mutual understanding. His work on pattern and phraseologies is of great interest in an attempt to answer the ”why this perfectly valid English sentence fits in a single pattern?” question.

This was followed by interesting talks from ‘Essexians’ working in different universities and firms across the globe. This included recent work on Computational Linguistics (CL), Natural Language Processing (NLP) and Machine Learning (ML). One of those was a collaboration work between Essex University and Signal– a startup company in London.

The event closed with more socialising, drinks and dinner at a Nepalese restaurant in Colchester, courtesy of the LAC group.

In general I found the event very interesting, well organised and rich in terms of historical evidences on the beginning of Language and Computation. It was also of great interest to know about current work and state-of-the-art in CL, NLP and ML presented by the event attendances.

I would very much like to thank The Language and Computation group at Essex Universities for the invitation and their time and effort organising this wonderful event.

Mahmoud El-Haj

Senior Research Associate

CASS, Lancaster University