apriori算法c实现
A. 机器学习有哪些算法
1. 线性回归
在统计学和机器学习领域,线性回归可能是最广为人知也最易理解的算法之一。
2. Logistic 回归
Logistic 回归是机器学习从统计学领域借鉴过来的另一种技术。它是二分类问题的首选方法。
3. 线性判别分析
Logistic 回归是一种传统的分类算法,它的使用场景仅限于二分类问题。如果你有两个以上的类,那么线性判别分析算法(LDA)是首选的线性分类技术。
4.分类和回归树
决策树是一类重要的机器学习预测建模算法。
5. 朴素贝叶斯
朴素贝叶斯是一种简单而强大的预测建模算法。
6. K 最近邻算法
K 最近邻(KNN)算法是非常简单而有效的。KNN 的模型表示就是整个训练数据集。
7. 学习向量量化
KNN 算法的一个缺点是,你需要处理整个训练数据集。
8. 支持向量机
支持向量机(SVM)可能是目前最流行、被讨论地最多的机器学习算法之一。
9. 袋装法和随机森林
随机森林是最流行也最强大的机器学习算法之一,它是一种集成机器学习算法。
想要学习了解更多机器学习的知识,推荐CDA数据分析师课程。CDA(Certified Data Analyst),即“CDA 数据分析师”,是在数字经济大背景和人工智能时代趋势下,面向全行业的专业权威国际资格认证,旨在提升全民数字技能,助力企业数字化转型,推动行业数字化发展。点击预约免费试听课。
B. 用Matlab实现apriori算法关联规则的挖掘程序,完整有详细注解
下面这段是apriori算法中由2频繁项集找k频繁项集的程序,程序中有两个问题:
1、似乎while循环的K永远都是固定的,也就是都是频繁2项集的个数。得到频繁3项集后K的个数不是要变吗?如何体现呢?
2、程序中有两个for的大循环,但是发现结果是只要找到一个频繁3项集第二个for循环就会结束,但是其实还应该有其它的频繁3项集。for循环不是应该无条件执行到参数k结束吗?当时k值是15,可是程序结束的时候i=2,j=3,然后j就不执行4以及一直到k的部分了。是什么原因呢?麻烦高手指点一下。急啊……
while( k>0)
le=length(candidate{1});
num=2;
nl=0;
for i=1:k-1
for j=i+1:k
x1=candidate{i}; %candidate初始值为频繁2项集,这个表示频繁项集的第i项
x2=candidate{j};
c = intersect(x1, x2);
M=0;
r=1;
nn=0;
l1=0;
if (length(c)==le-1) & (sum(c==x1(1:le-1))==le-1)
houxuan=union(x1(1:le),x2(le));
%树剪枝,若一个候选项的某个K-1项子集为非频繁,则剪枝掉
sub_set=subset(houxuan);
%生成该候选项的所有K-1项子集
NN=length(sub_set);
%判断这些K-1项自己是否都为频繁的
while(r & M<NN)
M=M+1;
r=in(sub_set{M},candidate);
end
if M==NN
nl=nl+1;
%候选k项集
cand{nl}=houxuan;
%记录每个候选k项集出现的次数
le=length(cand{1});
for i=1:m
s=cand{nl};
x=X(i,:);
if sum(x(s))==le
nn=nn+1;
end
end
end
end
%从候选集中找频繁项集
if nn>=th
ll=ll+1;
candmid{nl}=cand{nl};
pfxj(nl).element=cand{nl};
pfxj(nl).time=nn;
disp('得到的频繁项集为:')
result=(candmid{nl});
disp(result);
end
end
end
end
C. 关联规则分析怎么做
关联规则是反映一个事物与其他事物之间的相互依存性和关联性,常用于实体商店或在线电商的推荐系统:通过对顾客的购买记录数据库进行关联规则挖掘,最终目的是发现顾客群体的购买习惯的内在共性,例如购买产品A的同时也连带购买产品B的概率,根据挖掘结果,调整货架的布局陈列、设计促销组合方案,实现销量的提升,最经典的应用案例莫过于<啤酒和尿布>。
关联规则分析中的关键概念包括:支持度(Support)、置信度(Confidence)与提升度(Lift)。首先,我们简单温故下这3个关键指标:
支持度 (Support)
支持度是两件商品(A∩B)在总销售笔数(N)中出现的概率,即A与B同时被购买的概率。
公式:
例子说明:
比如某超市2016年有100w笔销售,顾客购买可乐又购买薯片有20w笔,顾客购买可乐又购买面包有10w笔,那可乐和薯片的关联规则的支持度是20%,可乐和面包的支持度是10%。
置信度 (Confidence)
置信度是购买A后再购买B的条件概率。简单来说就是交集部分C在A中比例,如果比例大说明购买A的客户很大期望会购买B商品。
公式:
例子说明:
某超市2016年可乐购买次数40w笔,购买可乐又购买了薯片是30w笔,顾客购买可乐又购买面包有10w笔,则购买可乐又会购买薯片的置信度是75%,购买可乐又购买面包的置信度是25%,这说明买可乐也会买薯片的关联性比面包强,营销上可以做一些组合策略销售。
提升度 (Lift)
提升度表示先购买A对购买B的概率的提升作用,用来判断规则是否有实际价值,即使用规则后商品在购物车中出现的次数是否高于商品单独出现在购物车中的频率。如果大于1说明规则有效,小于1则无效。
公式:
例子说明:
可乐和薯片的关联规则的支持度是20%,购买可乐的支持度是3%,购买薯片的支持度是5%,则提升度是1.33>1, A-B规则对于商品B有提升效果。
理论很简单,真正实践起来却会遇到种种困难,印证了那句"数据分析师的50%~80%的时间都花在了处理数据上”,例如一般POS明细是以下图表形式展现:
要计算支持度(Support)、置信度(Confidence)与提升度(Lift),首先需要知道Freq(A∩B)、Freq(A)、Freq(B)和总笔数数值,那么需要对商品进行排列组合。
所以,我们希望转换成下表形式,如销售ID=000001, 4种商品的两两组合(种):
若一个收银小票(销售ID)有30种商品,则组合数达到:
而可视化层级上还需要展现集团下每个分公司、每个城市、每个门店、月度、季度或者年度时间的关联规则分析,如果用传统的工具来实现上述分析无异于大海捞针。
下面我们就来看看在BDP中如何实现Apriori算法,实现关联规则分析:
商品两两组合的初步想法是通过量化的思想对商品进行编码,比方说可按照增序(从1开始),算出每笔销售单最大值,求出两者差值得到一组数组,通过数组行转列形式实现2种商品两两组合。
图:销售单2种商品两两组合逻辑图
操作①:
【工作表】-【创建合表】-【SQL创建】
图:商品量化
上图转换成日期的形式,主要目的是为下一步的数组转列做准备,为配合explode()函数使用。其中需要说明的是上图[日期]字段是自定义日期,可以更改成任意日期,没有实际日期意义。
图:商品组合数效果
上图主要使用的关键函数是FILL_DATES([日期1],[日期2]),Explode()。组合效果初显现,只是缺另一个商品名,然后把[下一日期]字段通过LEFT JOIN 关联出商品B的名称。
操作②:
【工作表】-【创建合表】-【多表关联】 用于创建表关联 包括(LEFT/INNER/ FULL JOIN)
图:商品组合数实现
从上图可以看到A商品和B商品两两组合逻辑已完成,在当前表基础上我们已经可以去做连带分析内容。
在这里,求Freq(A)和Freq(B)和总笔数数值就不祥述了,思想大致是求出所有销售商品的A 和B商品的频次,通过合表关联,整合到一张大表。
操作③:
【工作表】-【创建合表】-【追加合并】合并订单总数 ,A商品订单数,B商品订单数和A∩B商品连带笔数
图:追加合并逻辑实现
追加合并可以把相同字段商品合并在一起,方便计算三个指标(支持度、置信度、提升度)有利于可视化展现。
操作④:
可视化展现:【BDP】-【仪表盘】
图:仪表盘全局展示
注:为了更好体现可视化效果,这部分的可视化展示成果并非使用上述的测试数据或某个企业数据。
制作三个图表进行购物篮分析:
图: TOP 20商品连带次数
上图反映季度连带最高频次商品,高联带商品意味着对客户吸引力大商品粘性强,同时也可以查看不同分公司的TOP20连带情况。根据结果我们可以合理设计促销策略,例如买2送1等。
图:商品组合指标
置信度高说明商品连带紧密,说明客户连带意愿强,同时关注支持度,支持度高说明是需求量大,如果支持度低,置信度高其实对市场作用是有限小的。
图:购物篮分析详情
通过单价,支持度,置信度,提升度综合指标来看待商品组合,发现高价值关联商品,有助于提升客单价,同时也需要考虑提升度,提升度小于1,提升效果有限,可以把精力花在提升度大于1的商品组合。
同样地,我们是否可以实现三种商品的组合呢?答案是显然的,只要我们深入理解以上过程,三种商品关联也是可以在BDP中实现的。 作者 熊辉 本文转载自 BDP商业数据平台,转载需授权
D. python apriori算法代码怎么实现
classApriori(object):
def__init__(self,filename,min_support,item_start,item_end):
self.filename=filename
self.min_support=min_support#最小支持度
self.min_confidence=50
self.line_num=0#item的行数
self.item_start=item_start#取哪行的item
self.item_end=item_end
self.location=[[i]foriinrange(self.item_end-self.item_start+1)]
self.support=self.sut(self.location)
self.num=list(sorted(set([jforiinself.locationforjini])))#记录item
self.pre_support=[]#保存前一个support,location,num
self.pre_location=[]
self.pre_num=[]
self.item_name=[]#项目名
self.find_item_name()
self.loop()
self.confidence_sup()
defdeal_line(self,line):
"提取出需要的项"
return[i.strip()foriinline.split('')ifi][self.item_start-1:self.item_end]
deffind_item_name(self):
"根据第一行抽取item_name"
withopen(self.filename,'r')asF:
forindex,lineinenumerate(F.readlines()):
ifindex==0:
self.item_name=self.deal_line(line)
break
defsut(self,location):
"""
输入[[1,2,3],[2,3,4],[1,3,5]...]
输出每个位置集的support[123,435,234...]
"""
withopen(self.filename,'r')asF:
support=[0]*len(location)
forindex,lineinenumerate(F.readlines()):
ifindex==0:continue
#提取每信息
item_line=self.deal_line(line)
forindex_num,iinenumerate(location):
flag=0
forjini:
ifitem_line[j]!='T':
flag=1
break
ifnotflag:
support[index_num]+=1
self.line_num=index#一共多少行,出去第一行的item_name
returnsupport
defselect(self,c):
"返回位置"
stack=[]
foriinself.location:
forjinself.num:
ifjini:
iflen(i)==c:
stack.append(i)
else:
stack.append([j]+i)
#多重列表去重
importitertools
s=sorted([sorted(i)foriinstack])
location=list(sfors,_initertools.groupby(s))
returnlocation
defdel_location(self,support,location):
"清除不满足条件的候选集"
#小于最小支持度的剔除
forindex,iinenumerate(support):
ifi<self.line_num*self.min_support/100:
support[index]=0
#apriori第二条规则,剔除
forindex,jinenumerate(location):
sub_location=[j[:index_loc]+j[index_loc+1:]forindex_locinrange(len(j))]
flag=0
forkinsub_location:
ifknotinself.location:
flag=1
break
ifflag:
support[index]=0
#删除没用的位置
location=[ifori,jinzip(location,support)ifj!=0]
support=[iforiinsupportifi!=0]
returnsupport,location
defloop(self):
"s级频繁项级的迭代"
s=2
whileTrue:
print'-'*80
print'The',s-1,'loop'
print'location',self.location
print'support',self.support
print'num',self.num
print'-'*80
#生成下一级候选集
location=self.select(s)
support=self.sut(location)
support,location=self.del_location(support,location)
num=list(sorted(set([jforiinlocationforjini])))
s+=1
iflocationandsupportandnum:
self.pre_num=self.num
self.pre_location=self.location
self.pre_support=self.support
self.num=num
self.location=location
self.support=support
else:
break
defconfidence_sup(self):
"计算confidence"
ifsum(self.pre_support)==0:
print'min_supporterror'#第一次迭代即失败
else:
forindex_location,each_locationinenumerate(self.location):
del_num=[each_location[:index]+each_location[index+1:]forindexinrange(len(each_location))]#生成上一级频繁项级
del_num=[iforiindel_numifiinself.pre_location]#删除不存在上一级频繁项级子集
del_support=[self.pre_support[self.pre_location.index(i)]foriindel_numifiinself.pre_location]#从上一级支持度查找
#printdel_num
#printself.support[index_location]
#printdel_support
forindex,iinenumerate(del_num):#计算每个关联规则支持度和自信度
index_support=0
iflen(self.support)!=1:
index_support=index
support=float(self.support[index_location])/self.line_num*100#支持度
s=[jforindex_item,jinenumerate(self.item_name)ifindex_itemini]
ifdel_support[index]:
confidence=float(self.support[index_location])/del_support[index]*100
ifconfidence>self.min_confidence:
print','.join(s),'->>',self.item_name[each_location[index]],'min_support:',str(support)+'%','min_confidence:',str(confidence)+'%'
defmain():
c=Apriori('basket.txt',14,3,13)
d=Apriori('simple.txt',50,2,6)
if__name__=='__main__':
main()
Apriori(filename, min_support, item_start, item_end)
参数说明
filename:(路径)文件名
min_support:最小支持度
item_start:item起始位置
item_end:item结束位置
importapriori
c=apriori.Apriori('basket.txt',11,3,13)
输出:
E. 用C实现apriori基本算法的代码
Apriori算法的实现,关键是建立其数学模型.以前我写作业时,设计的数据结构如下:
#include<stdio.h>
#include<string.h>
#include<malloc.h>
#define ITEM_NAME_LENGTH 20
#define MIN_SUPPORT 2
//项集结构
struct ITEMSET
{
char itemName[ITEM_NAME_LENGTH];
struct ITEMSET *next;
};
//数据库结构
struct TRANSACTION
{
unsigned int tranID;
struct ITEMSET *itemPoint;
struct TRANSACTION *next;
};
//大项目集结构
struct BIGITEMSET
{
struct ITEMSET *itemPoint;
unsigned int count;
struct BIGITEMSET *next;
};
//以下是数据库
char *tran1[3]={"1","3","4"};
char *tran2[3]={"2","3","5"};
char *tran3[4]={"1","2","3","5"};
char *tran4[2]={"2","5"};
//以下是变量声明
struct TRANSACTION *tranHead;
struct BIGITEMSET *bigHead;
struct BIGITEMSET *test;
struct BIGITEMSET *subSetHeadC1,*subSetHeadC2;
当真正理解该算法后,再写程序并不难.
F. 急求用C实现的Apriori算法的 代码
http://www.csc.liv.ac.uk/~frans/Notes/KDD/AssocRuleMine/apriori.html
.h
====================================
/*----------------------------------------------------------------------
File : apriori.h
Contents: apriori algorithm for finding frequent item sets
(specialized version for FIMI 2003 workshop)
Author : Christian Borgelt
History : 15.08.2003 file created from normal apriori.c
16.08.2003 parameter for transaction filtering added
18.08.2003 dynamic filtering decision based on times added
21.08.2003 transaction sort changed to heapsort
20.09.2003 output file made optional
----------------------------------------------------------------------*/
/*
Modified by : Frédéric Flouvat
Modifications : store the positive and negative border into an
an input trie for ABS
process stastical informations on dataset to stop
the apriori classical iterations
Author : Frédéric Flouvat
----------------------------------------------------------------------*/
#ifndef APRIRORI_H
#define APRIRORI_H
#include <iostream>
using namespace std;
#define MAXIMAL
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <time.h>
#include <assert.h>
#include "tract.h"
#include "istree.h"
#include "Application.h"
/*----------------------------------------------------------------------
Preprocessor Definitions
----------------------------------------------------------------------*/
#define PRGNAME "fim/apriori"
#define DESCRIPTION "frequent item sets miner for FIMI 2003"
#define VERSION "version 1.7 (2003.12.02) " \
"(c) 2003 Christian Borgelt"
/* --- error codes --- */
#define E_OPTION (-5) /* unknown option */
#define E_OPTARG (-6) /* missing option argument */
#define E_ARGCNT (-7) /* too few/many arguments */
#define E_SUPP (-8) /* invalid minimum support */
#define E_NOTAS (-9) /* no items or transactions */
#define E_UNKNOWN (-18) /* unknown error */
#ifndef QUIET /* if not quiet version */
#define MSG(x) x /* print messages */
#else /* if quiet version */
#define MSG(x) /* suppress messages */
#endif
#define SEC_SINCE(t) ((clock()-(t)) /(double)CLOCKS_PER_SEC)
#define RECCNT(s) (tfs_reccnt(is_tfscan(s)) \
+ ((tfs_delim(is_tfscan(s)) == TFS_REC) ? 0 : 1))
#define BUFFER(s) tfs_buf(is_tfscan(s))
/*----------------------------------------------------------------------
Constants
----------------------------------------------------------------------*/
#ifndef QUIET /* if not quiet version */
/* --- error messages --- */
static const char *errmsgs[] = {
/* E_NONE 0 */ "no error\n",
/* E_NOMEM -1 */ "not enough memory\n",
/* E_FOPEN -2 */ "cannot open file %s\n",
/* E_FREAD -3 */ "read error on file %s\n",
/* E_FWRITE -4 */ "write error on file %s\n",
/* E_OPTION -5 */ "unknown option -%c\n",
/* E_OPTARG -6 */ "missing option argument\n",
/* E_ARGCNT -7 */ "wrong number of arguments\n",
/* E_SUPP -8 */ "invalid minimal support %d\n",
/* E_NOTAS -9 */ "no items or transactions to work on\n",
/* -10 to -15 */ NULL, NULL, NULL, NULL, NULL, NULL,
/* E_ITEMEXP -16 */ "file %s, record %d: item expected\n",
/* E_DUPITEM -17 */ "file %s, record %d: plicate item %s\n",
/* E_UNKNOWN -18 */ "unknown error\n"
};
#endif
/*----------------------------------------------------------------------
Global Variables
----------------------------------------------------------------------*/
#ifndef QUIET
static char *prgname; /* program name for error messages */
#endif
static ITEMSET *itemset = NULL; /* item set */
static TASET *taset = NULL; /* transaction set */
static TATREE *tatree = NULL; /* transaction tree */
static ISTREE *istree = NULL; /* item set tree */
static FILE *in = NULL; /* input file */
static FILE *out = NULL; /* output file */
extern "C" TATREE * apriori( char*fn_in, char*fn_out, int supp, int & level,
Trie * bdPapriori, Trie * bdn, set<Element> * relist, double ratioNfC, double & eps, int ismax,
vector< unsigned int > * stat, int & maxBdP, bool & generatedFk, bool verbose ) ;
#endif
.c
============================================
/*----------------------------------------------------------------------
File : apriori.c
Contents: apriori algorithm for finding frequent item sets
(specialized version for FIMI 2003 workshop)
Author : Christian Borgelt
History : 15.08.2003 file created from normal apriori.c
16.08.2003 parameter for transaction filtering added
18.08.2003 dynamic filtering decision based on times added
21.08.2003 transaction sort changed to heapsort
20.09.2003 output file made optional
----------------------------------------------------------------------*/
/*
Modified by : Frédéric Flouvat
Modifications : store the positive and negative border into an
an input trie for ABS
process stastical informations on dataset to stop
the apriori classical iterations
Author : Frédéric Flouvat
----------------------------------------------------------------------*/
#include "apriori.h"
/*----------------------------------------------------------------------
Main Functions
----------------------------------------------------------------------*/
static void error (int code, ...)
{ /* --- print an error message */
#ifndef QUIET /* if not quiet version */
va_list args; /* list of variable arguments */
const char *msg; /* error message */
assert(prgname); /* check the program name */
if (code < E_UNKNOWN) code = E_UNKNOWN;
if (code < 0) { /* if to report an error, */
msg = errmsgs[-code]; /* get the error message */
if (!msg) msg = errmsgs[-E_UNKNOWN];
fprintf(stderr, "\n%s: ", prgname);
va_start(args, code); /* get variable arguments */
vfprintf(stderr, msg, args);/* print error message */
va_end(args); /* end argument evaluation */
}
#endif
#ifndef NDEBUG /* if debug version */
if (istree) ist_delete(istree);
if (tatree) tat_delete(tatree);
if (taset) tas_delete(taset, 0);
if (itemset) is_delete(itemset);
if (in) fclose(in); /* clean up memory */
if (out) fclose(out); /* and close files */
#endif
exit(code); /* abort the program */
} /* error() */
/*--------------------------------------------------------------------*/
TATREE * apriori( char*fn_in, char*fn_out, int supp, int & level, Trie * bdPapriori,
Trie * bdn , set<Element> * relist , double ratioNfC, double & eps,int ismax,
vector< unsigned int > * stat, int & maxBdP, bool & generatedFk, bool verbose )
{
int i, k, n; /* loop variables, counters */
int tacnt = 0; /* number of transactions */
int max = 0; /* maximum transaction size */
int empty = 1; /* number of empty item sets */
int *map, *set; /* identifier map, item set */
char *usage; /* flag vector for item usage */
clock_t t, tt, tc, x; /* timer for measurements */
double actNfC = 1 ;
double avgNfC = 0 ;
int nbgen = 0 ;
int nbfreq = 0 ;
level = 1 ;
bool endApriori = false ; // boolean to stop the initial classial apriori approach
int bdnsize = 0 ; // number of itemsets found infrequent
/* --- create item set and transaction set --- */
itemset = is_create(); /* create an item set and */
if (!itemset) error(E_NOMEM); /* set the special characters */
taset = tas_create(itemset); /* create a transaction set */
if (!taset) error(E_NOMEM); /* to store the transactions */
if( verbose ) MSG(fprintf(stderr, "\n")); /* terminate the startup message */
/* --- read transactions --- */
if( verbose )MSG(fprintf(stderr, "reading %s ... ", fn_in));
t = clock(); /* start the timer and */
in = fopen(fn_in, "r"); /* open the input file */
if (!in) error(E_FOPEN, fn_in);
for (tacnt = 0; 1; tacnt++) { /* transaction read loop */
k = is_read(itemset, in); /* read the next transaction */
if (k < 0) error(k, fn_in, RECCNT(itemset), BUFFER(itemset));
if (k > 0) break; /* check for error and end of file */
k = is_tsize(itemset); /* update the maximal */
if (k > max) max = k; /* transaction size */
if (taset && (tas_add(taset, NULL, 0) != 0))
error(E_NOMEM); /* add the loaded transaction */
} /* to the transaction set */
fclose(in); in = NULL; /* close the input file */
n = is_cnt(itemset); /* get the number of items */
if( verbose ) MSG(fprintf(stderr, "[%d item(s),", n));
if( verbose ) MSG(fprintf(stderr, " %d transaction(s)] done ", tacnt));
if( verbose ) MSG(fprintf(stderr, "[%.2fs].\n", SEC_SINCE(t)));
/* --- sort and recode items --- */
if( verbose ) MSG(fprintf(stderr, "sorting and recoding items ... "));
t = clock(); /* start the timer */
map = (int*)malloc(is_cnt(itemset) *sizeof(int));
if (!map) error(E_NOMEM); /* create an item identifier map */
n = is_recode(itemset, supp, 2, map); /* 2: sorting mode */
tas_recode(taset, map, n); /* recode the loaded transactions */
max = tas_max(taset); /* get the new maximal t.a. size */
// use in the other part of the implementation to have the corresponding
// identifiant to an internal id
stat->reserve( n+2 ) ;
stat->push_back( 0 ) ;
for(int j= 0; j< n ; j++ )
{
stat->push_back( 0 ) ;
relist->insert( Element( atoi( is_name( itemset, j ) ) ,j) );
}
if( verbose ) MSG(fprintf(stderr, "[%d item(s)] ", n));
if( verbose ) MSG(fprintf(stderr, "done [%.2fs].\n", SEC_SINCE(t)));
/* --- create a transaction tree --- */
if( verbose ) MSG(fprintf(stderr, "creating transaction tree ... "));
t = clock(); /* start the timer */
tatree = tat_create(taset,1); /* create a transaction tree */
if (!tatree) error(E_NOMEM); /* (compactify transactions) */
tt = clock() -t; /* note the construction time */
if( verbose ) MSG(fprintf(stderr, "done [%.2fs].\n", SEC_SINCE(t)));
/* --- create an item set tree --- */
if( verbose ) MSG(fprintf(stderr, "checking subsets of size 1"));
t = clock(); tc = 0; /* start the timer and */
istree = ist_create(n, supp); /* create an item set tree */
if (!istree) error(E_NOMEM);
for (k = n; --k >= 0; ) /* set single item frequencies */
ist_setcnt(istree, k, is_getfrq(itemset, k));
ist_settac(istree, tacnt); /* set the number of transactions */
usage = (char*)malloc(n *sizeof(char));
if (!usage) error(E_NOMEM); /* create a item usage vector */
/* --- check item subsets --- */
while (ist_height(istree) < max && ( ( ismax == -1 && endApriori == false )
|| ist_height(istree) < ismax )
)
{
nbgen = 0 ;
nbfreq = 0 ;
level ++ ;
i = ist_check(istree,usage);/* check current item usage */
if (i < max) max = i; /* update the maximum set size */
if (ist_height(istree) >= i) break;
k = ist_addlvl(istree, nbgen); /* while max. height is not reached, */
if (k < 0) error(E_NOMEM); /* add a level to the item set tree */
if (k != 0) break; /* if no level was added, abort */
if( verbose ) MSG(fprintf(stderr, " %d", ist_height(istree)));
if ((i < n) /* check item usage on current level */
&& (i *(double)tt < 0.1 *n *tc)) {
n = i; x = clock(); /* if items were removed and */
tas_filter(taset, usage); /* the counting time is long enough, */
tat_delete(tatree); /* remove unnecessary items */
tatree = tat_create(taset, 1);
if (!tatree) error(E_NOMEM);
tt = clock() -x; /* rebuild the transaction tree and */
} /* note the new construction time */
x = clock(); /* start the timer */
ist_countx(istree, tatree, nbfreq, istree->supp ); /* count the transaction tree */
tc = clock() -x; /* in the item set tree */
actNfC = 1-double(nbfreq)/double(nbgen) ;
avgNfC = avgNfC + actNfC ;
if( verbose )
{
cout<<" \t Fk : "<<nbfreq<<" Ck : "<<nbgen<<" NFk/Ck "<<actNfC<<" avg NFk/Ck "<<avgNfC/(level-1)<<endl;
}
bdnsize += nbgen - nbfreq ;
if( level >=4 && ( bdnsize / nbgen < 1.5 ) && ( bdnsize > 100 ) )
{
if( actNfC < ratioNfC )
{
eps = 0 ;
endApriori = true ;
}
else if( actNfC > 0.25 )
endApriori = true ;
}
} /* and note the new counting time */
if( verbose ) MSG(fprintf(stderr, " done [%.2fs].\n", SEC_SINCE(t)));
/* --- filter item sets --- */
t = clock(); /* start the timer */
#ifdef MAXIMAL /* filter maximal item sets */
if( verbose ) MSG(fprintf(stderr, "filtering maximal item sets ... "));
if( ratioNfC == 0 || nbgen < k+1 || ist_height(istree)>= max )
ist_filter2(istree, IST_MAXFRQ, 0);
else
ist_filter2(istree, IST_MAXFRQ, bdn);
if( verbose ) MSG(fprintf(stderr, " done [%.2fs].\n", SEC_SINCE(t)));
empty = (n <= 0) ? 1 : 0; /* check whether the empty item set */
#endif /* is maximal */
#ifdef CLOSED /* filter closed item sets */
if( verbose ) MSG(fprintf(stderr, "filtering closed item sets ... "));
ist_filter(istree, IST_CLOSED);
if( verbose ) MSG(fprintf(stderr, " done [%.2fs].\n", SEC_SINCE(t)));
for (k = n; --k >= 0; ) /* check for an item in all t.a. */
if (is_getfrq(itemset, k) == tacnt) break;
empty = (k <= 0) ? 1 : 0; /* check whether the empty item set */
#endif /* is closed */
/* --- print item sets --- */
for (i = ist_height(istree); --i >= 0; )
map[i] = 0; /* clear the item set counters */
if( verbose ) MSG(fprintf(stderr, "writing %s ... ", (fn_out) ? fn_out : "<none>"));
t = clock(); /* start the timer and */
if (fn_out) { /* if an output file is given, */
out = fopen(fn_out, "w"); /* open the output file */
if (!out) error(E_FOPEN, fn_out);
if (empty) fprintf(out, " (%d)\n", tacnt);
} /* report empty item set */
ist_init(istree); /* init. the item set extraction */
set = is_tract(itemset); /* get the transaction buffer */
for (n = empty; 1; n++) { /* extract item sets from the tree */
k = ist_set(istree, set, &supp);
if (k <= 0) break; /* get the next frequent item set */
map[k-1]++; /* count the item set */
if (fn_out) { /* if an output file is given */
for (i = 0; i < k; i++) { /* traverse the items */
fputs(is_name(itemset, set[i]), out);
fputc(' ', out); /* print the name of the next item */
} /* followed by a separator */
fprintf(out, "(%d)\n", supp);
} /* print the item set's support */
else
{
short unsigned * is = new short unsigned[k] ;
for (i = 0; i < k; i++) /* traverse the items */
{
is[i] = set[i] ;
}
if( k < level || nbgen < k+1 || ist_height(istree)>= max )
{
bdPapriori->insert(is, k ,supp ) ;
(*stat)[ 0 ] ++;
(*stat)[ k+1 ]++;
if( maxBdP < k )
maxBdP = k ;
}
else
{
generatedFk = true ;
}
delete[] is;
}
}
if (fn_out) { /* if an output file is given */
if (fflush(out) != 0) error(E_FWRITE, fn_out);
if (out != stdout) fclose(out);
out = NULL; /* close the output file */
}
if( verbose ) MSG(fprintf(stderr, "[%d set(s)] done ", n));
if( verbose ) MSG(fprintf(stderr, "[%.2fs].\n", SEC_SINCE(t)));
/* --- print item set statistics --- */
k = ist_height(istree); /* find last nonzero counter */
if ((k > 0) && (map[k-1] <= 0)) k--;
if( verbose ){
printf("%d\n", empty); /* print the numbers of item sets */
for (i = 0; i < k; i++) printf("%d\n", map[i]);
}
/* --- clean up --- */
#ifndef NDEBUG /* if this is a debug version */
free(usage); /* delete the item usage vector */
free(map); /* and the identifier map */
ist_delete(istree); /* delete the item set tree, */
if (taset) tas_delete(taset, 0); /* the transaction set, */
is_delete(itemset); /* and the item set */
#endif
return tatree ;
}
G. 设c是apriori算法产生的ck中的一个候选项集.在剪枝步,需要检查多少个长度为
2 FAA算法思想
2.1 链表数组定义及生成算法。链表数组定义:数组为n个指针的一维数组P[n],对应数据库中的频繁项I1,I2,…,In,对应数组长度n为数据库中频繁项的数量。结点为事务结点,分为事务域、计数域和指针域。事务域是以频繁项为后缀的事务编码。计数域是该事务编码的数量,指针域是指向下一结点的指针。
编码方法:设数据库中有n个频繁项I1,I2,…,In。事务t的编码就是长度为n的0、1位串。在t中出现的项,其相应位置用1表示,否则填0。例如,有四个频繁项a,b,c,d。那么,一个包含a和c的事务就被映射为1010。
链表数组的构造过程如下:(1)扫描事务数据库,产生所有频繁1-项集及支持度计数,依据支持度计数降序排列,生成FI-List。(2)再次扫描数据库,将每条记录中不满足最小支持度计数的项删除,并将剩余项按照FI-List重新排序。设形成的新序列为{m1,m2,…,mn},依次取出序列中的前k(1≤k≤n)项组成子序列{m1,m2,…,mk},对每个子序列进行编码并建立一个与之对应的事务结点,并按照子序列中最后一项追加到P[n]中相应链上。
H. python哪个包实现apriori
要用apriori不需要哪个包,要有一个实现apriori功能的.py文件,将这个文件放置在你要调用的文件相同的地址,然后用fromaprioriimport*来使用。。
apriori.py下载地址:链接:https://pan..com/s/1XpKHfUXGDIXj7CcYJL0anA
I. 怎么用java实现apriori算法
fromoperatorimportand_:
def__init__(self,inputfile):
self.transactions=[]
self.itemSet=set([])
inf=open(inputfile,'rb')
forlineininf.readlines():
elements=set(filter(lambdaentry:len(entry)>0,line.strip().split(',')))
iflen(elements)>0:
self.transactions.append(elements)
forelementinelements:
self.itemSet.add(element)
inf.close()
self.toRetItems={}
self.associationRules=[]
defgetSupport(self,itemcomb):
iftype(itemcomb)!=frozenset:
itemcomb=frozenset([itemcomb])
within_transaction=lambdatransaction:rece(and_,[(itemintransaction)foriteminitemcomb])
count=len(filter(within_transaction,self.transactions))
returnfloat(count)/float(len(self.transactions))
defrunApriori(self,minSupport=0.15,minConfidence=0.6):
itemCombSupports=filter(lambdafreqpair:freqpair[1]>=minSupport,
map(lambdaitem:(frozenset([item]),self.getSupport(item)),self.itemSet))
currentLset=set(map(lambdafreqpair:freqpair[0],itemCombSupports))
k=2
whilelen(currentLset)>0:
currentCset=set([i.union(j)(i.union(j))==k])
currentItemCombSupports=filter(lambdafreqpair:freqpair[1]>=minSupport,
map(lambdaitem:(item,self.getSupport(item)),currentCset))
currentLset=set(map(lambdafreqpair:freqpair[0],currentItemCombSupports))
itemCombSupports.extend(currentItemCombSupports)
k+=1
forkey,supportValinitemCombSupports:
self.toRetItems[key]=supportVal
self.calculateAssociationRules(minConfidence=minConfidence)
defcalculateAssociationRules(self,minConfidence=0.6):
forkeyinself.toRetItems:
subsets=[frozenset(item)forkinrange(1,len(key))foritemincombinations(key,k)]
forsubsetinsubsets:
confidence=self.toRetItems[key]/self.toRetItems[subset]
ifconfidence>minConfidence:
self.associationRules.append([subset,key-subset,confidence])
用Scala也大概六十多行:
importscala.io.Sourceimportscala.collection.immutable.Listimportscala.collection.immutable.Setimportjava.io.Fileimportscala.collection.mutable.MapclassAprioriAlgorithm(inputFile:File){
vartransactions:List[Set[String]]=List()
varitemSet:Set[String]=Set()
for(line<-Source.fromFile(inputFile).getLines()){
valelementSet=line.trim.split(',').toSet
if(elementSet.size>0){
transactions=transactions:+elementSet
itemSet=itemSet++elementSet
}
}
vartoRetItems:Map[Set[String],Double]=Map()
varassociationRules:List[(Set[String],Set[String],Double)]=List()
defgetSupport(itemComb:Set[String]):Double={
defwithinTransaction(transaction:Set[String]):Boolean=itemComb
.map(x=>transaction.contains(x))
.receRight((x1,x2)=>x1&&x2)
valcount=transactions.filter(withinTransaction).size
count.toDouble/transactions.size.toDouble
}
defrunApriori(minSupport:Double=0.15,minConfidence:Double=0.6)={
varitemCombs=itemSet.map(word=>(Set(word),getSupport(Set(word))))
.filter(wordSupportPair=>(wordSupportPair._2>minSupport))
varcurrentLSet:Set[Set[String]]=itemCombs.map(wordSupportPair=>wordSupportPair._1).toSet
vark:Int=2
while(currentLSet.size>0){
valcurrentCSet:Set[Set[String]]=currentLSet.map(wordSet=>currentLSet.map(wordSet1=>wordSet|wordSet1))
.receRight((set1,set2)=>set1|set2)
.filter(wordSet=>(wordSet.size==k))
valcurrentItemCombs=currentCSet.map(wordSet=>(wordSet,getSupport(wordSet)))
.filter(wordSupportPair=>(wordSupportPair._2>minSupport))
currentLSet=currentItemCombs.map(wordSupportPair=>wordSupportPair._1).toSet
itemCombs=itemCombs|currentItemCombs
k+=1
}
for(itemComb<-itemCombs){
toRetItems+=(itemComb._1->itemComb._2)
}
calculateAssociationRule(minConfidence)
}
defcalculateAssociationRule(minConfidence:Double=0.6)={
toRetItems.keys.foreach(item=>
item.subsets.filter(wordSet=>(wordSet.size<item.size&wordSet.size>0))
.foreach(subset=>{associationRules=associationRules:+(subset,itemdiffsubset,
toRetItems(item).toDouble/toRetItems(subset).toDouble)
}
)
)
associationRules=associationRules.filter(rule=>rule._3>minConfidence)
}}
我不建议用Java,应改用Python或Scala一类的语言。如果用Python,代码大概50行左右,但可以想象用Java便看起来复杂得多。看如下:
J. 要实验apriori算法 用UCI里面的哪些案例
你可以选用 Default Task 为Recommender-Systems 的那些数据集。
比如:Anonymous Microsoft Web Data Data Set
这个数据集是从1998年2月的某个星期的微软官方网站的访问日志中取样得到的。
包含了38000 个匿名用户在一周内对 www.microsoft.com 站点的某些网页的访问记录。
有人用这个数据集来做协同过滤的推荐系统,当然也可以应用apriori算法。
我顺便帮你把数据文件传上来了。
Source:
Creators:
Jack S. Breese, David Heckerman, Carl M. Kadie
Microsoft Research, Redmond WA, 98052-6399, USA
breese'@'microsoft.com,heckerma'@'microsoft.com,carlk'@'microsoft.com
Donors:
Breese:, Heckerman, & Kadie
Data Set Information:
We created the data by sampling and processing the www.microsoft.com logs. The data records the use of www.microsoft.com by 38000 anonymous, randomly-selected users. For each user, the data lists all the areas of the web site (Vroots) that user visited in a one week timeframe.
Users are identified only by a sequential number, for example, User #14988, User #14989, etc. The file contains no personally identifiable information. The 294 Vroots are identified by their title (e.g. "NetShow for PowerPoint") and URL (e.g. "/stream"). The data comes from one week in February, 1998.
Attribute Information:
Each attribute is an area ("vroot") of the www.microsoft.com web site.
The datasets record which Vroots each user visited in a one-week timeframe in Feburary 1998.
Relevant Papers:
J. Breese, D. Heckerman., C. Kadie _Empirical Analysis of Predictive Algorithms for Collaborative Filtering_ Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, July, 1998.
[Web Link]
Also, expanded as Microsoft Research Technical Report MSR-TR-98-12, The papers are available on-line at:[Web Link]
参考:
http://archive.ics.uci.e/ml/datasets/Anonymous+Microsoft+Web+Data
http://archive.ics.uci.e/ml/machine-learning-databases/anonymous/
下载地址
http://archive.ics.uci.e/ml/machine-learning-databases/anonymous/anonymous-msweb.data
详细说明
http://archive.ics.uci.e/ml/machine-learning-databases/anonymous/anonymous-msweb.info