Journal For Class English Learning Skills in Academic Contexts
This month, I participated in a data mining contest held by Alibaba Corp. During the contest, I learned a lot about algorithms and models. So I’d like to write down the experience and make a summary.
The theme of the competition is recommending product or service to the consumers based on the user behavior data which was produced on mobile devices. There is two data sets offered by Alibaba Corp. One is a subset of goods, and the other is the behavior data of 10000 consumers. The time span of the data is a month, including double 12 which is a promotion day. And the final goal is to predict which goods the specific consumer buy on the day following the month. And the evaluation standard is F1 score which can be calculated by precision and recall. [ 2 * precision * recall / ( precision + recall ) ]
At first, I looked over the dataset and tried to find out some patterns or outliers. And I realized that the geography data was incomplete. Maybe it is caused by the consumers’ network environment. So I must be careful if I use the geography data in my model. Then I began to establish my model, I think logistic regression could be a good choice which is not complex but practical for many situations. As for the geography data is not complete, I didn’t use it in my model. I just selected consumer behavior and time as the feature of the model. I separated the data into 4 groups. What’s more, December 12th is a special day that lots of promotion activities were held on Alibaba’s electric business platform. So the consumer behavior on that day can be seen as irrational, I must deal with it carefully.
In last paragraph, I talked about the outliers on December 12th. I think clustering could be a good choice, as the behaviors may be influenced by the promotion on that day, I decided to clustering them and lower their weight. However, the effect is not very ideal and I have no time exploring more efficient algorithms.
It is worth mentioning that I also tried collaborative filtering algorithm in the contest. Unfortunately, the result is too bad because the prediction set will be too large if we tried to find out similarities among so many consumers. Although Amazon is said to use this algorithm and increase their sales significantly, I think the result set is too inaccurate. Maybe they have mixed several algorithms together, or it is difficult to give out a satisfied result with only one algorithm.
There are still many algorithms which I haven’t enough time to have a try. I think SVM and decision tree are both available choices in this task.
It is a pity that I didn’t qualify for the second round of the match. I am curious about the cloud computing platform developed by Alibaba Corp. I hope I can do better in the next year’s contest.