As definitions are no longer being updated, the test suite for them
can be removed. The ML trainer already has statistics on how accurate
it is for both training mode and model creation mode, as well as an
interactive mode to test new messages.
Remove the global section, which filtered Free Company ads and RP
ads. Prevent reporting of messages that were filtered by
definitions. Make the ML mode default and mark definitions mode as
obsolete.
Add a step to normalise messages to the ML pipeline. This ensures
computed properties run on the raw data (which is actually partially
normalised by the compute context). This prevents properties which
rely on symbols (e.g. "B>") from being unable to work properly when
normalisation happens before they have access to the input.
Certain symbols are turned into one space so the model sees multiple
words instead of one. Previously "[RP]Hi" would turn into "RPHi" and
be its own token. Now it turns into "RP" and "Hi", counting as two
tokens. This change increased the model's accuracy.
Also make "18", "http", "https", and LGBT-related words into stop
words (meaning they're ignored). Each of these stop words made the
model more accurate and reduced unwanted bias.
Messages destined for ML are now normalised by the plugin in the same
way the model's input is for training. This should make the results
come closer to expected.