SVM (support vector machine) is another supervised learning by maximizing the margin between data points of different classes. SVM does a good job of tolerating individual outliers but doesn’t perform well with too much noise in the data. Compared to Naive Bayes on a identify author task, it is much slower but much more accurate. On a small email dataset (~16,000) from enron, Naive Bayes takes 1.5 seconds to train, while SVM took a whopping 185 seconds. Underneath the hood, these time differences can be attributed to the fact that SVM runs in cubic time while Naive Bayes runs in linear time. Even with the increase in duration of training, the final accuracy measures were 97.32% and 98.41%. At this point, it is important to note that different algorithm are appropriate for different tasks and there are many parameters to tune into to increase the performance of the algorithm. For example, SVM is well-suited for complex dataset because it uses the kernel trick to construct hyperplanes to come up with non-linear decision boundaries. Naive Bayes simply created a linear decision boundary to separate two classes, but depending on the kernel function and other parameters, SVM can generate all kinds of boundaries covering the classified data. As a result, SVM might need to be tuned in to the dataset to get an accurate prediction.
Decision trees are ask questions at each tree level and reach a class at each final node down the tree. If you were to visualize a decision tree for making a feature that gives you tips on how you should based on weather, the nodes along the tree would involve questions such as is the temperature below average, is it snowing or is it raining. The edges of the tree would direct the decision tree onto its final class of selections for clothing, such as dress warmly and so on. As one can see, the questions could represent a linear decision boundary where the algorithm will decide on a specific area based on the response to the question. But how does a decision tree build the tree in the first place, given only features and labels? It will try to maximize for “information gain” which is calculated based on entropy of parent nodes and children. Overall, decision trees can be prone to overfitting and the tree growth needs to be stopped when appropriate by using the parameters. Decision trees are also used in ensemble methods, which use multiple algorithms to obtain better predictive performance than a single algorithm.