Commit Graph

14 Commits

Author SHA1 Message Date
c038adc4e9 fix(trainer): replace newlines automatically 2021-02-24 20:01:35 -05:00
2181649b22 feat: add "come" and "join" as stop words 2021-02-21 15:50:05 -05:00
0dc0c2ef00 feat(data): add more data
Also pull out stop words into field.
2021-02-20 19:25:15 -05:00
c3df0a1f8e feat: add normalisation to pipeline
Add a step to normalise messages to the ML pipeline. This ensures
computed properties run on the raw data (which is actually partially
normalised by the compute context). This prevents properties which
rely on symbols (e.g. "B>") from being unable to work properly when
normalisation happens before they have access to the input.
2021-02-17 21:45:09 -05:00
d00b3b0845 feat: better handle puncutation
Certain symbols are turned into one space so the model sees multiple
words instead of one. Previously "[RP]Hi" would turn into "RPHi" and
be its own token. Now it turns into "RP" and "Hi", counting as two
tokens. This change increased the model's accuracy.

Also make "18", "http", "https", and LGBT-related words into stop
words (meaning they're ignored). Each of these stop words made the
model more accurate and reduced unwanted bias.

Messages destined for ML are now normalised by the plugin in the same
way the model's input is for training. This should make the results
come closer to expected.
2021-02-17 20:01:34 -05:00
Anna
87c5602319 feat: use separate process for classifying 2021-01-30 16:02:37 -05:00
df66d397ed fix(trainer): use LF newlines for real 2021-01-02 17:28:17 -05:00
081e670da4 fix(trainer): use LF newlines 2021-01-02 16:59:40 -05:00
9f15bb7d0d feat(trainer): have trainer sort data automatically 2021-01-02 16:59:00 -05:00
2f7761b9b0 chore(trainer): only save model on full run 2021-01-02 07:31:34 -05:00
753e0f710e refactor(trainer): use correct schema, though it shouldn't matter 2020-12-28 22:04:50 -05:00
1b8f7806f5 refactor: put computation in interface
This basically undoes the benefits of the previous commit. May end up being reverted.
2020-12-28 21:48:31 -05:00
effe41a345 refactor(training): compute properties in pipeline
Hopefully no longer required the data structure to be updated when new computed properties are added. This should also reduce duplication and make it easier to make bigger changes to the model without needing to update the plugin.
2020-12-28 21:01:35 -05:00
bd05abb5e0 feat(trainer): add trainer to actual repo 2020-12-28 20:14:19 -05:00