Sentiment indicators based on some Machine Learning algorithms are fashionable these days, for good reasons. There are, however, some major flaws in this approach when not implemented with a clear strategy.
I come from the Economics field, where one uses data to test a theory.
Data science mostly works the other way around: one builds a model then tries to make sense of it.
There is no absolute right or wrong way of doing things; but we need to be aware of the pitfalls inherent to each approach.
"Sentiment" indicators are very popular in quantitative finance these days. They often rely on Machine Learning and other statistical techniques to compile big chunk of data. I am a big fan of the technology, Natural Language Processing in particular, but only when it comes with a clear sense of purpose.
The single most important thing in a Machine Learning application (in the fields I know at least) is to establish the use case. Whose "sentiment" is this? What does it add to well-known market prices? Are robots better at "sentiment" than humans now, really?
Too often, a "sentiment" indicator is there for lack of a better word and, sometimes, of a proper thinking about the concept and goal.
Let's take an example
Someone builds a "sentiment" indicator aggregating all sorts of news about a few companies, using Natural Language Processing. It correlates well with stock prices.
Now, looking at the data a bit more closely, it appears that a good portion of these news relate with the stock price itself, like:
- "Company X rose 5% yesterday on the back of..."
- "Traders turned negative on Company Y".
There may or may not be a new external reason mentioned in the article.
That can be a recipe for disaster. The whole "correlation" with stock prices may well be coming from this type of news. In our experience, they can represent a surprisingly large share of the sample.
These news really need to be taken out of the sample. Then, if there is still a good "correlation" between what's left of the sentiment and stock prices, we have got something to work with.
If not, "sentiment" is just a stock prices lagging indicator, by about a day or so. The correlation with stocks is a spurious relationship.
When we aggregate inflation news in the News Inflation Pressure Indices, we have a special model trained to detect the news about official inflation releases, to take them out of our sample.
It's not a nice-to-have feature: it must be done. To put the numbers out: these news represent around 22% of our sample. We have 1.3 million news which relate with inflation, over a three-years period. Of these, just over 100k are selected to be inflation relevant. And of these, 22k need to be taken out because they relate with inflation releases.
If we were not to do that, we'd have a spurious indicator.
What's the bottom line?
It's not because we do Machine Learning that we should not think about a model strategy.
Data Science should start with... knowing the data, which requires field expertise. To build a useful Machine Learning model, the data scientist and the analyst need to work together.