Description
E11 Decision Tree
17341015 Hongzheng Chen
Contents
1 Datasets 2
2 Decision Tree 2
2.1 ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 C4.5 and CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Tasks 4
4 Codes and Results 4
1 Datasets
The UCI dataset (http://archive.ics.uci.edu/ml/index.php) is the most widely used dataset for machine learning. If you are interested in other datasets in other areas, you can refer to https:
//www.zhihu.com/question/63383992/answer/222718972.
Today’s experiment is conducted with the Adult Data Set which can be found in http:// archive.ics.uci.edu/ml/datasets/Adult.
You can also find 3 related files in the current folder, adult.name is the description of Adult Data Set, adult.data is the training set, and adult.test is the testing set. There are 14 attributes in this dataset:
>50K, <=50K.
1. age: continuous.
2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,
State-gov, Without-pay, Never-worked.
3. fnlwgt: continuous.
Assoc-voc, 9th, 7th-8th, 12th, Masters, 5. 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5. education-num: continuous.
6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated,
Widowed, Married-spouse-absent, Married-AF-spouse.
7. occupation: Tech-support, Craft-repair, Other-service, Sales,
Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct,
Adm-clerical,Farming-fishing,Transport-moving,Priv-house-serv,Protective-serv, Armed-Forces.
8. relationship: Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried.
9. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. sex: Female, Male.
11. capital-gain: continuous.
12. capital-loss: continuous.
13. hours-per-week: continuous.
14. native-country: United-States, Cambodia,England,Puerto-Rico,Canada,Germany,
Outlying-US(Guam-USVI-etc),India,Japan,Greece, South,China,Cuba,Iran,Honduras,
Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France,
Dominican-Republic,Laos,Ecuador,Taiwan, Haiti, Columbia,Hungary,Guatemala,
Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador, Trinadad&Tobago,Peru,Hong, Holand-Netherlands.
Prediction task is to determine whether a person makes over 50K a year.
2 Decision Tree
2.1 ID3
ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data. ID3 Algorithm:
1. Begins with the original set S as the root node.
2. Calculate the entropy of every attribute a of the data set S.
3. Partition the set S into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum.
4. Make a decision tree node containing that attribute.
5. Recur on subsets using remaining attributes.
• every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples.
• there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset.
• there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute.
ID3 shortcomings:
• ID3 does not guarantee an optimal solution.
• ID3 can overfit the training data.
• ID3 is harder to use on continuous data.
Entropy:
Entropy H(S) is a measure of the amount of uncertainty in the set S.
H(S) = X −p(x)log2 p(x) x∈X
where
• S is the current dataset for which entropy is being calculated
• X is the set of classes in S
• p(x) is the proportion of the number of elements in class x to the number of elements in set S.
Information gain:
Information gain IG(A) is the measure of the difference in entropy from before to after the set S is split on an attribute A. In other words, how much uncertainty in S was reduced after splitting set S on attribute A.
IG(S,A) = H(S) − Xp(t)H(t) = H(S) − H(S | A)
t∈T
where
• H(S) is the entropy of set S
• T is the subsets created from splitting set S by attribute A such that S = ∪t∈Tt • p(t) is the proportion of the number of elements in t to the number of elements in set S
• H(t) is the entropy of subset t.
2.2 C4.5 and CART
C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rules precondition if the accuracy of the rule improves without it.
C5.0 is Quinlans latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.
CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.
3 Tasks
• Given the training dataset adult.data and the testing dataset adult.test, please accomplish the prediction task to determine whether a person makes over 50K a year in adult.test by using ID3 (or C4.5, CART) algorithm (C++ or Python), and compute the accuracy.
1. You can process the continuous data with bi-partition method.
2. You can use prepruning or postpruning to avoid the overfitting problem.
3. You can assign probability weights to solve the missing attributes (data) problem.
• Please finish the experimental report named E11 YourNumber.pdf, and send it to ai 201901@foxmail.com
4 Codes and Results
I firstly run my code in jupyter notebook (decision_tree.ipynb), then generate the python script file dt.py. Since it’s hard to insert my code in LATEX, so please directly refer to my source file dt.py. Only pandas and numpy are used for my decision tree, which can be viewed in requirements.txt.
Some preprocessing techniques are used, for example, I remove the fnlwgt attribute since it does not give more information, and I fill the ? with the most value in that column.
Moreover, pre-pruning is used to avoid overfitting. I directly cut 10% of the train data as the validation set for pre-pruning testing.
The results are shown below. The best try gives 84.23% accuracy on the test set, which is also in the range of the official accuracy provided in adult.names. The running log can be found in *.log.




Reviews
There are no reviews yet.