首页 > 学院 > 开发设计 > 正文

What kind of young people group can we cluster

2019-11-06 08:26:30
字体:
来源:转载
供稿:网友

数据来自:https://www.kaggle.com/miroslavsabo/young-people-survey

```{r, include=FALSE}library(readr)library(ggplot2)library(gridExtra)library(stringr)library(plyr)#mapvalueslibrary(mice)library(VIM)library(cluster)#clustplotlibrary(NbClust)library(DMwR)library(mclust)library(fpc)

Purpose of Cluster

This data sets is totally a typical case for the cluster, very quantitative responses, enough response number, polyhedral and interesting topic.There must be some interesting story on these youth waiting for being dug out. So I gonna try to cluster several topics and explain what the new-made groups mean and what kind of social character in those groups.

Data PReparation and Exploration

Data preparation

resp <- read_csv("~/Desktop/Über R/Excesise/Young People Survey/responses.csv")

The naming variable way in this datasets is totally pain of ass for R user at all. The space and special symbol(such as “-“), which lets the variables to be “resp$Left - right handed“, gets in the way when I try to use ggplot to graph. Thus I should deal with all the variable names at first.

old_name <- names(resp)new_name <- str_replace_all(old_name, "//s", "_") #replace the spacenew_name <- str_replace_all(new_name, "//-", "vs")#replace the special symbolnew_name <- str_replace_all(new_name, "//,", "")new_name <- str_replace_all(new_name, "///", "_or_")new_name <- str_replace_all(new_name, "//'", "")colnames(resp) <- new_name

2.2 Demographics of young people exploration

Cut the part of Demographics dataset from whole data frame.

dem <- resp[,141:150]

Let’s take a short look on the young people demographics information.

pile.bar <- function(x, xn, Freq, title = NULL){ z <- as.data.frame(table(x)) colnames(z) <- c(xn,Freq) pr <- prop.table(z$Freq) p.bar <- ggplot(z, aes_string(x = xn, y = Freq))+ geom_bar(stat = "identity",position="stack",width=0.9)+ geom_text(aes(label= round(pr, 2)), colour="white", vjust=1, size=4)+ ggtitle(title) return(p.bar)} #fittable function for the explorationage_hist <- pile.bar(dem$Age, "Age","Freq")edu_hist <- pile.bar(dem$Education, "Education","Freq")grid.arrange(age_hist,edu_hist, nrow = 2)

From the fundamental variables we can see, this survey’s sample comes main from secondary school and college student, the age range concentrates on 18 ~ 21.

gender_hist <- pile.bar(dem$Gender, "Gender","Freq")hand_hist <- pile.bar(dem$Left_vs_right_handed, "Left_vs_right_handed","Freq")sibl_hist <- pile.bar(dem$Only_child, "Only_child","Freq")area_hist <- pile.bar(dem$Village_vs_town, "Village_vs_town","Freq")hou.k_hist <- pile.bar(dem$House_vs_block_of_flats, "House_vs_block_of_flats","Freq")grid.arrange(gender_hist,hand_hist,sibl_hist,area_hist,hou.k_hist, ncol = 2)

On the other hand, most of these young people have sibling, live in city block flat.

3 Data Cluster

In this part, I will choose at least two cluster ways on one dataset, and compare to measure the accuracy of these cluster methods.Then I gonna choose the good performance method to explain the meaning of cluster result.

3.1 Movie Preferences Cluster

Let’s take a proper variable number dataset at first.

3.1.1 Hierarchical clustering

Given to our variables value is the same dimension, I don’t need to normalize the data and form the cluster fit directly.

movie <- resp[,20:31]#use agglomerative hierarchical clustering to clusterhc_mo<- hclust(dist(movie, method = "euclidean"),method = "ward.D")plot(hc_mo, hang = -0.01, cex = 0.7)

I use Ward’s method in this section. Our dataset has a large observations and a lot of variable number, and Ward’s method tends to produce clusters with proper numbers of observations. By the way, it can also be sensitive to outlines. From above tree, I’m not so sure how many cluster number it should be.And I will seek help from function NbClust.

nc <- NbClust(movie, distance="euclidean", min.nc=2, max.nc=15, method="ward.D")table(nc$Best.n[1,])

Okay, the number of cluster I get here is 3. Next step is cut the tree to into different clusters.

fit_mo <- cutree(hc_mo, k=3)table(fit_mo)plot(hc_mo)rect.hclust(hc_mo, k= 3, border = "red")#segment inspectionaggregate(movie, by=list(cluster=fit_mo), mean)clusplot(movie, fit_mo, color=TRUE, shade=TRUE, labels = 4, lines=0, main ="hclust plot")

The cluster is cut clearly, but as the segment description shows that this cluster is kind of odd. Three groups are umbalance distribution, the first variable, which is general in this topic, has only a value.

3.1.2 K-means cluster

Decide the number of cluster before starting the K-mean fit form.

nc2 <- NbClust(movie, min.nc=2, max.nc=15, method="kmeans")table(nc2$Best.n[1,])

It’s still 3 with Kmeans. K-means is a NA sensitive method, I should deal with the NA in the dataset. Function mice is a good missing values imputation method.

matrixplot(movie)#missing value distributionmovie_na <- moviemovie_fm <- mice(movie)movie_complete <- complete(movie_fm)matrixplot(movie_complete)set.seed(101)km_mo <- kmeans(movie_complete, 3)aggregate(movie_complete, by=list(cluster=km_mo$cluster), mean)clusplot(movie_complete, km_mo$cluster,color=TRUE, shade=TRUE, labels=4, lines=0, main="K-means cluster plot")

The segment description in kmeans is complete, and compare to hclust, there is less overlap between three clusters.And it’s apparent that the result of kmeans cluster is better than hcluster.

3.1.3 application of cluster

According to the kmeans cluster result, I can barely name the three cluster as:

cluster 3: movie enthdusiasts, love all kind of movie except romantic and Fantasy/Fairy tales(・∀・)cluster 1: normal movie consumer, focus on romantic, Fantasy/Fairy tales, comedy, and no feeling on horror, thriller, war, and westerncluster 2: specific kind of movie lover, focus on war, thriller, action Let’s go further to compare the three clusters demographics characters. Before the comparison, I will take some transform to be better to compare.#mutate the catagorate variables to numericdems <- demdems$Gender <- as.numeric(mapvalues(dems$Gender, from = c("female","male"), to = 1:2 ))dems$Left_vs_right_handed <- as.numeric(mapvalues(dems$Left_vs_right_handed, from = c("left handed","right handed"), to = 1:2))dems$Education <- as.numeric(mapvalues(dems$Education, from = c("currently a primary school pupil","primary school", "secondary school","college/bachelor degree", "masters degree", "doctorate degree"), to = 1:6 ))dems$Only_child <- as.numeric(mapvalues(dems$Only_child, from = c("yes","no"), to = 1:2 ))dems$Village_vs_town <- as.numeric(mapvalues(dems$Village_vs_town, from = c("city","village"), to = 1:2))dems$House_vs_block_of_flats <- as.numeric(mapvalues(dems$House_vs_block_of_flats, from = c("block of flats","house/bungalow"), to = 1:2))#imputate the missing value placedems_fm <- mice(dems)dems_co <- complete(dems_fm)aggregate(dems_co[,1:4], by = list(km_mo$cluster), mean)#dems_co is the missing value imputated datasetround(prop.table(table(km_mo$cluster, dem$Gender),1),2)round(prop.table(table(dem$Education,km_mo$cluster),2),2)round(prop.table(table(km_mo$cluster, dem$Village_vs_town),1),2)round(prop.table(table(km_mo$cluster, dem$House_vs_block_of_flats),1),2)

As you can see, the most difference between them is gender:

cluster 3: movie enthdusiasts are mainly guyscluster 1: normal movie consumers are mostly girlscluster 2: specific kind of movie lover are more elder but quite equal on gender

3.2 Music Preferences Cluster

After using Kmeans and hclust to cluster, I’d like to try a new way, model-based cluster.

In contrast to hierarchical clustering and k-means clustering, which use a heuristic approach and do not depend on a formal model. Model-based clustering techniques assume varieties of data models and apply an EM algorithm to obtain the most likely model, and further use the model to infer the most likely number of clusters. ——Machine Learning With R Cookbook

3.2.1 Model-based cluster

#dealing missing valuemusic <- resp[,1:19]music_fm <- mice(music)music_co <- complete(music_fm)#mclustmc_mu <- Mclust(music_co)summary(mc_mu)

It tells us there is only 2 clusters(components), but it offers not enough help to understand the difference in this dataset. Thus I try above way with the assistance of NbClust(), and get the result that it’s better to be 3 cluster. Then I gonna force 3 solution to Mclust():

nc3 <- NbClust(music_co, min.nc=2, max.nc=15, method="kmeans")table(nc3$Best.n[1,])# the result of NbClust(it costs too much time and I don't run here) is 3, so I force the model to 3 clustermc_mu3 <- Mclust(music_co, G = 3)summary(mc_mu3)

When I force it to 3 cluster result, the BIC(Bayesian information criterion) goes lower, which means 3 cluster solution is better than 2.

clusplot(music_co, mc_mu3$classification, color=TRUE, shade=TRUE, labels = 4, lines=0, main ="Mclust plot")aggregate(music_co, by=list(cluster=mc_mu3$classification), mean)

As result shows, with 3-cluster solution, three segments has significant difference on mean, and shares the same pattern with movie preference. But we can see a quite much overlap in the 3 segments in the clust plot, it means the cluster model performs not so good in the dataset. Another side,on the clust plot cluster 2 and 3 have more overlap than with cluster 1.

cluster 3: still music enthdusiasts, love almost all kind of music, but on pop and hiphop, which is not higher than cluster 1, metal/harfrock and punk,which is lower than cluster 2cluster 1: normal music consumer, like the kind of music that is popular at all peoplecluster 2: specific kind of movie lover, focus on metal/harfrock and punk, this kind of special ones

I still do another cross table with demographics data to check whether this 3 segments is the same group with movie part.

aggregate(dems_co[,1:4], by = list(mc_mu3$classification), mean)#dems_co is the missing value imputated datasetround(prop.table(table(mc_mu3$classification, dem$Gender),1),2)round(prop.table(table(dem$Education,mc_mu3$classification),2),2)round(prop.table(table(mc_mu3$classification, dem$Village_vs_town),1),2)round(prop.table(table(mc_mu3$classification, dem$House_vs_block_of_flats),1),2)

The answer is Yes. Music cluster segments is roughly the same group in movie with a slight difference in some variables.

3.3 Personality traits, views on life, & opinions

In this section, I’ll try PAM cluster,which is similar with kmeans but has this character:

Partitioning Around Medoids (PAM) algorithm. PAM uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search.

It works better in mixed data types and isn’t limited to continuous variables. And there exists 3 classified variables in this dataset, .

3.3.1 PAM cluster

With the experience from the last two parts, I cluster this datasets on 3 segments at first.

pers <- resp[,77:133]pers_fm <- mice(pers)pers_co <- complete(pers_fm)#transform the classified variablespers_co$Punctuality <- as.factor(pers_co$Punctuality)pers_co$Lying <- as.factor(pers_co$Lying)pers_co$Internet_usage <- as.factor(pers_co$Internet_usage)#PAM clusterpm_pe <- pam(daisy(pers_co,metric = "gower"), k = 3)clusplot(pers_co, pm_pe$clustering, color=TRUE, shade=TRUE, labels = 4, lines=0, main ="PAM cluster plot")aggregate(pers_co, by=list(cluster=pm_pe$clustering), mean)

It’s still the same pattern in the mean of 3 segments. However it’s different on the plot. There is larger overlap between cluster 3 and 1.

4 Conclusion

In these datasets, K-means has a better cluster performance than another 3 cluster methods. We can roughly cluster these young people as 3 segments:

cluster 3, secondaray and college male students is the majoritycluster 1, secondaray and college female students is the majoritycluster 2, master, doctor or another eld youth is the majority, gender is roughly balant in here

Cluster 3 and 2 share the same trend on the movie and music preferences, even cluster 2 has a more unique preferences in the two topics. And cluster 1 is more like the popular preferences in these topics. It means gender is a important factor in the topics. Cluster 3 and 1 have the same trend on the personality value, and cluster 2 has a quite distance with them. It means age or education is more important in the topic.


发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表