# IMP **Repository Path**: pilaoda/IMP ## Basic Information - **Project Name**: IMP - **Description**: IMP(Influence Maximize Problem) 影响力最大化问题 原创算法CELF-AB实现 - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2017-12-03 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # IMP(Influence Maximize Problem) # 影响力最大化问题 ---------- ## 1. Preliminaries - ISE : Monte Carlo method - IMP: Approximate Bayes Inference with CELF (CELF-AB) ---------- ## 2. Methodology - Important variables: - $n$: number of vertices in $G$ - $m$: number of edges in $G$ - $k$: number of seeds to be selected - $S$: a set of current seeds - $u$: specific vertice - $t$: specific vertice - $L(S)$: influence of set $S$ - $sim[u][t]$: 2 dimension matrix of possibility about $u$ activating $t$ - Approximate Bayes Inference: - First, this inference need to get a influence possibility of all single point as seed, we call it $sim$ (single influence map), which $sim[u][t] = P(t|u)$ means when $u$ is a single seed, how much possibility would the verticle t be activated. Let $L(S)$ mean the influence of seeds $S$. Obviously, $L(S=[u]) = \sum_{i=1}^n sim[u][t]$. The matrix $sim$ is the key of approximate bayes inference. In IC model, any influence to vertice $t$ from exist seeds $S$ combine with a possible new seed $u$ can be simply as below picture: ![p1](./report/p1.jpg) $W_{St}$ means when $S$ is activated, the possibility of activating $t$ through a path **p**, while **p** has no $u$. It’s necessary to point out that wst is a variable that relate to $u$, which means that the $W_{St}$ is different if the $u$ is changed. Thus, there is a equation set. $$ \begin{cases} P(u|S) = W_{Su} + W_{St}W_{tu}(1 - W_{Su}) \\ P(t|S) = W_{St} + W_{Su}W_{ut}(1 - W_{St}) \\ P(u|t) = W_{tu} + W_{tS}W_{Su}(1 - W_{tu}) \\ P(t|u) = W_{ut} + W_{uS}W_{St}(1 - W_{ut}) \\ P(S|u) = W_{uS} + W_{ut}W_{tS}(1 - W_{uS}) \\ P(S|t) = W_{tS} + W_{tu}W_{uS}(1 - W_{tS}) \\ \end{cases} $$ There are 6 unknown variables in this equation set, and once we solve it, it is easy to get below equation: $$P(t|S, u) = W_{ut} + W_{st} + W_{ut}W_{st} $$ - Then, we can get the influence formula: $L(S∪u) = \sum_{i=1}^n P(t|S, u)$, which is very important in **CELF** algorithm. When S size is only 1, we can get $P(S|u)$ and $P(S|t)$ from matrix sim. Unfortunately, when $|S| > 1$, we can not get $P(S|u)$ and $P(S|t)$ from $sim$ and get them from Monte Carlo method would cost a lot. Therefore, let’s just assume that $W_{uS} = 0$ and $W_{tS} = 0$ to get approximate equation, which shows a picture below. ![p1](./report/p2.jpg) $$ \begin{cases} P(u|S) = W_{Su} + W_{St}W_{tu}(1 - W_{Su}), \\ P(t|S) = W_{St} + W_{Su}W_{ut}(1 - W_{St}), \\ P(u|t) = W_{tu}, \\ P(t|u) = W_{ut} \\ \end{cases} $$ **Solution of above equation set:** $$ \begin{cases} W_{Su} = \frac{(s + (P(u|t) - P(u|S))*P(t|u) + P(u|t)P(t|S) - 1)}{(2P(u|t) - 2)P(t|u)} , \\ W_{Su} = \frac{(s + (P(u|t) - P(u|S))*P(t|u) + P(u|t)P(t|S) - 1)}{(2P(u|t) - 2)P(t|u)} , \\ s = \sqrt{P(u|t)^2P(t|u)^2 + P(u|S)^2P(t|u)^2 + P(u|t)^2P(t|S)^2 +2P(u|t)P(t|u)\\ *(2P(t|S) + 2P(u|S) - P(t|S)P(u|S) - P(u|t)P(t|S) - P(t|u)P(u|S))\\ - 2(P(u|t)P(t|u) + P(t|u)P(u|S) + P(u|t)P(t|S)) + 1} , \\ \end{cases} \\ $$ Then, $$ P(t|S, u) = \frac{1-s-P(t|u)P(u|S)}{2P(u|t)}, P(u|t) \ne 0 $$ Moreover, when $P(u|t) = W_{tu} = 0$, we need another new picture and new solution: ![p1](./report/p3.jpg) $$ \begin{cases} P(u|S) = W_{Su}, \\ P(t|S) = W_{St} + W_{Su}W_{ut}(1 - W_{St}), \\ P(t|u) = W_{ut} \\ \end{cases} $$ **Solution of above equation set:** $$ P(t|S,u) = \begin{cases} \frac{P(t|S)+P(t|u)-P(t|u)P(u|S)-P(t|S)P(t|u)}{1-P(t|u)P(u|S)}, P(u|t)=0 & P(t|u)P(u|S)<1 \\ 1, P(t|u)P(u|S=1 \\ \end{cases} $$ Actually, I have tried two kinds of Bayes calculation. 1. The first one is simple and only consider the situation of the third picture, which ignore the wtu. Let’s name it with **BayesCal1**. $$ P(t|S, u) = \begin{cases} P(t|S) + P(t|u) - P(t|u)P(u|S) - P(t|S)P(t|u), P(t|u)P(u|S)<1 \\ 1, P(t|u)P(u|S) = 1 \end{cases} \\ $$ 2. The second will consider the second situation with wtu may greater than 0. Let’s name it with **BayesCal2**. $$ s = \sqrt{P(u|t)^2P(t|u)^2 + P(u|S)^2P(t|u)^2 + P(u|t)^2P(t|S)^2 +2P(u|t)P(t|u)\\ *(2P(t|S) + 2P(u|S) - P(t|S)P(u|S) - P(u|t)P(t|S) - P(t|u)P(u|S))\\ - 2(P(u|t)P(t|u) + P(t|u)P(u|S) + P(u|t)P(t|S)) + 1} , \\ $$ $$ P(t|S, u) = \begin{cases} \frac{1-s-P(t|u)P(u|S)}{2P(u|t)}, P(u|t) \ne 0 \\ P(t|S) + P(t|u) - P(t|u)P(u|S) - P(t|S)P(t|u), P(t|u)P(u|S)<1 \\ 1, P(t|u)P(u|S) = 1 \end{cases} \\ $$ According to experiment, the total influence result that caculated by **BayesCal1** and **BayesCal2** has absolute error that less than **0.001**, while the time cost of **BayesCal2** is more than **20 times** of **BayesCal1**’s, so I only use BayesCal1 in my algorithm. The experiment also shows that the relative error between **BayesCal1** and **Monte Carlo Method** is always less than **0.030** in **IC** model. In **LT** model, the relative error is less than **0.022**. Thus, we can’t replace **Monte Carlo Method** with **BayesCal1** totally, but we can use **BayesCal1** when the time is very limited, because the **BayesCal1** can get any sizes of seeds in a quick time since the time complexity of it is $O(n)$ without any **Monte Carlo Method** after we get the sim. - Combining above ideas, we can get an **Approximate Bayes Inference** version of **CELF**: **CELF-AB** ---------- ### **Algorithm CELF-AB** - Require: G, k - Ensure: seed set S ``` S ← ∅; Q ← ∅; cur_infl = 0; cur_infl_map = null; for each u ∈ V do u.mg1 = L({u}); u.flag = 0. sim[u] = MonteCarlo([u]) Add u to Q. while |S| < k do u = top (root) element in Q. if u.flag == |S| then S ← S ∪ {u};Q ← Q − {u}; if u.flag == 0 then cur_infl_map = sim[u] else if time limited then cur_infl_map = BayesCal1(S) else cur_infl_map = MonteCarlo(S) cur_infl = sum(cur_infl_map) continue; else if time limited then u.mgl = sum(BayesCal1(S, u)) - cur_infl else u.mgl = sum(MonteCarlo(S ∪ u)) - cur_infl u.flag = |S|; Reinsert u into Q and heapify. ``` ---------- Variable ***cur_infl_map*** is a possibility map like $sim[u]$, while it expand to any size of seeds $S$ and ***cur_infl*** is the influence of current seeds $S$. ---------- ## 3. Empirical Verification - Environment: - Windows7-64bit, 8 cpu cores, Python2.7.14, pycharm, numpy - Data: NetHEPT. - $T(sim)$ means the time of getting the matrix sim which deponds on the graph size k and the round R of Monte Carlo Method. When R = 100, T(sim) is about 206s. - **CELF-ABC** means doing Approximate Bayes Inference **completely**. In this case, it do no Monte Carlo Method after getting $sim$, which is equal to **time limited** is true all the time. ### IC model: | Size | k = 10 | k = 10 | k = 50 | k = 50 | | -------- | -----: | -----: | -----: | :----: | | |Influence | Time| Influence | Time | | CELF | $510$ | $T(sim) + 112s$ | $1280$ | $T(sim) + 3325s$ | | **CELF-ABC** | $510$ | $T(sim) + 1s$ | $1273$ | $T(sim) + 1.3s$ | ### LT model: | Size | k = 10 | k = 10 | k = 50 | k = 50 | | -------- | -----: | -----: | -----: | :----: | | |Influence | Time| Influence | Time | | CELF | $637$ |$T(sim) + 145s$ | $1671$ | $T(sim) + 4304s$ | | **CELF-ABC** | $637$ | $T(sim) + 1s$ | $1657$ | $T(sim) + 1.2s$ | According to the result, when the $k$ is small, the **CELF-ABC** behave as good as **CELF** but loss a little influence when the $k$ is bigger comparing with **CELF**. However, the time effiency of **CELF-ABC** is much better than the original **CELF** when the $k$ become bigger. ## 4. References >[1] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. S. Glance. Cost-effective outbreak detection in networks. In *Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pages 420–429, 2007. >[2] Wei Chen, Yajun Wang, Siyu Yang. Efficient Influence Maximization in Social Networks.