当前位置: 首页 > 推荐系统 > 正文

Mahout学习总结

1 星2 星3 星4 星5 星 (3 次投票, 评分: 4.33, 总分: 5)
Loading ... Loading ...
baidu_share

最近在学习推荐系统,主要是mahout相关。在网上找了一些资料,大部分大同小异。初学者的经典资料:
1.基于 Apache Mahout 构建社会化推荐引擎
2.探索推荐引擎内部的秘密,第 1 部分: 推荐引擎初探
3.探索推荐引擎内部的秘密,第 2 部分: 深入推荐引擎相关算法 – 协同过滤
4.探索推荐引擎内部的秘密,第 3 部分: 深入推荐引擎相关算法 – 聚类

由于工作中主要是基于文章内容的推荐,主要是新闻这一块。

首先来技术选项

1.基于内容的推荐系统
对用户(User)和物品(Item)分别建模。
计算物品和用户的模型的相似度。
把和用户的模型相似度最高的物品推荐给用户。

2.基于协同过滤的推荐系统
与系统的业务无关。
不是根据用户和物品本身的属性,而是根据用户的访问记录来挖掘出相似度。

协同过滤的优点
1.业务无关
2.算法实现和基础数据采集相对简单
3.业界广泛采用,比如电商网站

考虑到新闻自身的特点,放弃协同过滤
1.协同过滤是基于访问记录进行推荐,只有被人访问过的文章才能被推荐出来,这对时效性要求比较高的新闻推荐是严重的缺陷。
2.新闻的生命周期很短,会造成访问记录的极度稀疏,这给根据访问记录来计算相似性带来了很大的困难。

因此采用内容推荐

文章之间相似度可以采用Tf–idf算法获得。或者通过lucene去获得。可以参考:elasticsearch moreLikeThis查询应用

这里先计算出文章相似度,实现了一个简单demo。

基于文章内容推荐

 
	public static void ff()throws Exception{
//2645769为第一个ItemID1,2389682为第二个ItemID2,31.78/100 为文章之间相似度
		GenericItemSimilarity.ItemItemSimilarity similarity = new GenericItemSimilarity.ItemItemSimilarity(2645769,2389682,31.78/100);
 
		List<GenericItemSimilarity.ItemItemSimilarity> similarities = new ArrayList<GenericItemSimilarity.ItemItemSimilarity>();
		similarities.add(similarity);
		//28.002708
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645769,2679041, 28.00/100);
		similarities.add(similarity);
		//26.309313
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645769,2656591, 26.30/100);
		//23.620678
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645769,2686496, 23.65/100);
		//22.648125
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645769,1129187, 22.64/100);
		similarities.add(similarity);
		//22.648125
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645767,2599815, 22.63/100);
		similarities.add(similarity);
		//22.648125
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645767,2540110, 22.62/100);
		similarities.add(similarity);
 
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645767,2700438, 22.61/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645767,2767605, 22.60/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2645767,2529194, 22.54/100);
		similarities.add(similarity);
 
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,2679041, 28.01/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,2656591, 28.04/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,2686496, 28.06/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,1129187, 27.06/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,2599815, 25.06/100);
		similarities.add(similarity);
 
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,2540110, 25.04/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2389682,2700438, 25.01/100);
		similarities.add(similarity);
 
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2686496,2700438, 25.06/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2686496,2679041, 25.03/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2686496,1129187, 25.04/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2686496,2599815, 25.07/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2686496,2540110, 25.08/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2686496,2679041, 28.01/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(1129187,2700438, 25.01/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2599815,2700438, 25.01/100);
		similarities.add(similarity);
		similarity = new GenericItemSimilarity.ItemItemSimilarity(2529194,2700438, 25.01/100);
		similarities.add(similarity);
 
 
 
		ItemSimilarity itemSimilarity = new GenericItemSimilarity(similarities); 
 
		DataModel model = new FileDataModel(new File("e:\\recommend\\usernews.txt"));//文件名一定要是绝对路径  
		GenericItemBasedRecommender recommender = new GenericItemBasedRecommender(model,itemSimilarity);
        List<RecommendedItem> recommendations=recommender.recommend(10, 2);
 
        for (RecommendedItem recommendation : recommendations) {
        	System.out.println(recommendation.getItemID());
			System.out.println(recommendation.getValue());
		}
 
        recommendations=recommender.mostSimilarItems(2645769, 2);
        for (RecommendedItem recommendation : recommendations) {
        	System.out.println(recommendation.getItemID());
			System.out.println(recommendation.getValue());
		}
	}

基于用户推荐

	public static void user()throws Exception{
		DataModel model = new FileDataModel(new File("e:\\recommend\\test.txt"));//文件名一定要是绝对路径  
		UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
		final UserSimilarity simi=new SpearmanCorrelationSimilarity(model);
		UserNeighborhood neighborhood = new NearestNUserNeighborhood(2,
				simi, model);
		Recommender recommender = new GenericUserBasedRecommender(model,
				neighborhood, similarity);
		List<RecommendedItem> recommendations = recommender.recommend(1, 1);//为用户1推荐两个ItemID  
		for (RecommendedItem recommendation : recommendations) {
			System.out.println(recommendation);
		}
	}

基于slopeOne算法推荐:

	public static void slopeOne()throws Exception{
		DataModel  model =new FileDataModel(new File("e:\\recommend\\test.txt"));
		Recommender recommender= new SlopeOneRecommender(model);
		List<RecommendedItem> recommendations =recommender.recommend(1, 1);
		for(RecommendedItem recommendation :recommendations){
		    System.out.println(recommendation);
		}
	}

NearestNUserNeighborhood:对每个用户取固定数量 N 的最近邻居
ThresholdUserNeighborhood:对每个用户基于一定的限制,取落在相似度门限内的所有用户为邻居。

test.txt数据:

1,101,5  
1,102,3  
1,103,2.5  
2,101,2  
2,102,2.5  
2,103,5  
2,104,2  
3,101,2.5  
3,104,4  
3,105,4.5  
3,107,5  
4,101,5  
4,103,3  
4,104,4.5  
4,106,4  
5,101,4  
5,102,3  
5,103,2  
5,104,4  
5,105,3.5  
5,106,4

usernews.txt数据:

1,2645769,1813
2,2645769,1815
3,2645769,1816
4,2645769,1817
5,2645769,1818
6,2645769,1819
7,2645769,1810
#8,2645769,1823
9,2645769,1843
10,2645769,1853
 
1,2645767,1903
2,2645767,1904
3,2645767,1913
4,2645767,1923
5,2645767,1973
#6,2645767,1933
7,2645767,1905
8,2645767,1908
9,2645767,1963
10,2645767,1933
 
1,2679041,1890
2,2679041,1290
3,2679041,1690
4,2679041,1891
#5,2679041,1892
6,2679041,1893
7,2679041,1894
8,2679041,1895
9,2679041,1896
10,2679041,1897
 
1,2389682,1988
2,2389682,1989
3,2389682,1987
4,2389682,1986
#5,2389682,1985
6,2389682,1984
7,2389682,1918
8,2389682,1928
9,2389682,1938
10,2389682,1948
 
1,2656591,2011
2,2656591,2012
3,2656591,2013
#4,2656591,2014
5,2656591,2015
6,2656591,2016
7,2656591,2017
8,2656591,2018
9,2656591,2019
10,2656591,2010
 
1,2686496,3123
2,2686496,3122
3,2686496,3121
4,2686496,3120
5,2686496,3124
#6,2686496,3125
7,2686496,3125
8,2686496,3723
9,2686496,3128
10,2686496,3129
 
1,1129187,4571
2,1129187,4572
3,1129187,4573
#4,1129187,4531
5,1129187,4541
6,1129187,4551
7,1129187,4171
8,1129187,4271
9,1129187,4371
10,1129187,4471
 
1,2599815,6732
2,2599815,6731
3,2599815,6730
4,2599815,6712
5,2599815,6722
#6,2599815,6742
7,2599815,6132
#8,2599815,6232
9,2599815,6332
10,2599815,6432
 
1,2540110,1341
2,2540110,1342
3,2540110,1325
4,2540110,1315
5,2540110,1365
6,2540110,1347
7,2540110,1348
8,2540110,1349
9,2540110,1145
#10,2540110,1245
 
1,2700438,4531
2,2700438,4532
3,2700438,4533
4,2700438,4512
5,2700438,4522
6,2700438,4542
7,2700438,4552
8,2700438,4562
9,2700438,4572
#10,2700438,4132
 
1,2767605,6531
2,2767605,6532
3,2767605,6533
#4,2767605,6534
5,2767605,6512
6,2767605,6522
7,2767605,6132
8,2767605,6132
9,2767605,6332
#10,2767605,6432
 
1,2529194,8761
2,2529194,8762
3,2529194,8763
4,2529194,8715
5,2529194,8725
#6,2529194,8735
7,2529194,8745
8,2529194,8755
#9,2529194,8165
10,2529194,8265
 
11,2767605,6532
12,2529194,8765

本文固定链接: http://www.chepoo.com/mahout-learning-summary.html | IT技术精华网

Mahout学习总结:目前有1 条留言

  1. 沙发
    :

    帮助很大,学习了

    [回复]

发表评论