One million news project
(and counting)

We have just released a first batch of News Inflation Pressure Indices and News Volume Indices. Here is a quick explanation on how the data is compiled.

Data sources

The input news stories are sourced from news websites and blogs. We use several APIs and our own scraping algorithms to build up history. Overall, our models have processed over a million news potentially dealing with inflation in the last three years.

Of these 1 million news stories, our main classification model has selected around 56 thousand news in English for their relevance to the near-term inflation outlook. We are in the process of expanding both the languages, to cover more countries, and history, to go further back than 2017.

See also this high level description of all the language models mentioned in this article.

Which news are deemed relevant? The criteria is relevance to the near-term inflation forecast. We have trained state-of-the-art language models to replicate in their own way what an inflation forecaster does. The idea is to catch utility price changes, airfares, phone plans adjustment, etc: all those events which will have an effect on the near-term inflation forecast. These type of news make the inflation forecast - even more so than plain time-series do.

Tidying up

Data cleaning is not an anecdotal step when we deal with unstructured data.

Here are just a couple of data cleaning operations we have found important:

Taking self-referencing stories out

A significant share of inflation related news deal with official releases. These news would relate some recent evolution in CPI, import prices, producer prices and other data published by the statistical offices.

We have trained our language models to take these news out.

The whole point of the exercise is to focus on new news that could have some impact on near-term inflation. We do not want lagging information - and even less so an outright spurious indicator.

Duplicates removal

We collect information from many different sources. Inevitably, the same news stories would occasionally reach us from slightly different URLs, maybe with a slightly different title or source name.

It is critical to be able to distinguish between unwanted duplicates and genuine news propagation across sources. We have also some language models specifically trained to do that in real-time.

Adding color: the sign, sector and location detection

The selected news are processed through a number of additional algorithms which will detect the sign of the news (does it mean inflation will go up, down or stay the same?), the sector (utilities, food, airfares, etc) and the location. To do that we have trained our own classification and Named Entity Recognition models.

NIPI: News Inflation Pressure Indices

Now that we are able to identify relevant news and process them, we can aggregate and quantify this information.

The NIPIs are a balance of positive and negative news, in a given region and/or sector.

They can be interpreted as a PMI (or ISM). The indices are normalised based on historical news volumes. 50 corresponds to an equal volume of positive and negative news, while a value close to 100 (or 0) indicates a one-way maximum historical volume of positive (negative) news.

From a statistical point of view, The NIPIs are effectively a balance of entropy measures, sum of probability of positive news minus sum of probability of negative news.

The NIPI on a given date aggregates the news of the previous 30 days up to that date.

NVI: News Volume Indices

While the NIPI provides information on the sign of the news flow, the NVI indicates the volume of the news flow.

The NVIs are simply the total news associated with a theme and/or location, scaled by the historical volume. A value of 100 means the news volume is 3 standard deviations above historical average.

Data availability

The data is produced daily. Subscribers to our dataset can have access to it on a daily basis through an API, or weekly through emails and auto generated dashboards

We have just released this sample of available data.

Our indices are currently available for the following regions: Global, Advanced economies, Large EM economies, US, UK, Australia, Canada, India and a few others.

They cover the following sectors: headline, core, food, utilities, airfares, indirect taxes.

Don't hesitate to reach out for a demo or a free trial.