# IMP

**Repository Path**: pilaoda/IMP

## Basic Information

- **Project Name**: IMP
- **Description**: IMP(Influence Maximize Problem)  影响力最大化问题 原创算法CELF-AB实现
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2017-12-03
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

﻿# IMP(Influence Maximize Problem)

# 影响力最大化问题

----------


## 1. Preliminaries
- ISE : Monte Carlo method
- IMP: Approximate Bayes Inference with CELF (CELF-AB)


----------


## 2. Methodology
 -  Important variables:
    - $n$: number of vertices in $G$
    - $m$: number of edges in $G$
    - $k$: number of seeds to be selected
    - $S$: a set of current seeds
    - $u$: specific vertice
    - $t$: specific vertice
    - $L(S)$: influence of set $S$
    - $sim[u][t]$: 2 dimension matrix of possibility about $u$ activating $t$


 - Approximate Bayes Inference:
     - First, this inference need to get a influence possibility of all single point as seed, we call it $sim$ (single influence map), which $sim[u][t] = P(t|u)$ means when $u$ is a single seed, how much possibility would the verticle t be activated. Let $L(S)$ mean the influence of seeds $S$. Obviously, $L(S=[u]) = \sum_{i=1}^n sim[u][t]$. 
    The matrix $sim$ is the key of approximate bayes inference. In IC model, any influence to vertice $t$ from exist seeds $S$ combine with a possible new seed $u$ can be simply as below picture:
![p1](./report/p1.jpg)

     $W_{St}$  means when $S$ is activated, the possibility of activating $t$ through a path **p**, while **p** has no $u$. It’s necessary to point out that wst is a variable that relate to $u$, which means that the $W_{St}$ is different if the $u$ is changed. Thus, there is a equation set.
$$ \begin{cases}
P(u|S) = W_{Su} + W_{St}W_{tu}(1 - W_{Su}) \\
P(t|S) = W_{St} + W_{Su}W_{ut}(1 - W_{St}) \\
P(u|t) = W_{tu} + W_{tS}W_{Su}(1 - W_{tu}) \\
P(t|u) = W_{ut} + W_{uS}W_{St}(1 - W_{ut}) \\
P(S|u) = W_{uS} + W_{ut}W_{tS}(1 - W_{uS}) \\
P(S|t) = W_{tS} + W_{tu}W_{uS}(1 - W_{tS}) \\
\end{cases} $$

     There are 6 unknown variables in this equation set, and once we solve it, it is easy to get below equation:
     $$P(t|S, u) = W_{ut} + W_{st} + W_{ut}W_{st} $$

     - Then, we can get the influence formula: $L(S∪u) = \sum_{i=1}^n P(t|S, u)$, which is very important in **CELF** algorithm. When S size is only 1, we can get $P(S|u)$ and $P(S|t)$ from matrix sim. Unfortunately, when $|S| > 1$, we can not get $P(S|u)$ and $P(S|t)$ from $sim$ and get them from Monte Carlo method would cost a lot. Therefore, let’s just assume that $W_{uS} = 0$ and $W_{tS} = 0$ to get approximate equation, which shows a picture below.
![p1](./report/p2.jpg)

$$ 
\begin{cases}
P(u|S) = W_{Su} + W_{St}W_{tu}(1 - W_{Su}),   \\
P(t|S) = W_{St} + W_{Su}W_{ut}(1 - W_{St}),   \\
P(u|t) = W_{tu},   \\
P(t|u) = W_{ut}   \\
\end{cases} 
$$

**Solution of above equation set:**

$$ 
\begin{cases}
W_{Su} = \frac{(s + (P(u|t) - P(u|S))*P(t|u) + P(u|t)P(t|S) - 1)}{(2P(u|t) - 2)P(t|u)} , \\
W_{Su} = \frac{(s + (P(u|t) - P(u|S))*P(t|u) + P(u|t)P(t|S) - 1)}{(2P(u|t) - 2)P(t|u)} , \\
s = \sqrt{P(u|t)^2P(t|u)^2 + P(u|S)^2P(t|u)^2 + P(u|t)^2P(t|S)^2 +2P(u|t)P(t|u)\\
*(2P(t|S) + 2P(u|S) - P(t|S)P(u|S) - P(u|t)P(t|S) - P(t|u)P(u|S))\\
- 2(P(u|t)P(t|u) + P(t|u)P(u|S) + P(u|t)P(t|S)) + 1} , \\
\end{cases} \\ 
$$

Then,

$$
P(t|S, u) = \frac{1-s-P(t|u)P(u|S)}{2P(u|t)}, P(u|t) \ne 0
$$

Moreover, when $P(u|t) = W_{tu} = 0$, we need another new picture and new solution:
![p1](./report/p3.jpg)

$$ \begin{cases}
P(u|S) = W_{Su},   \\
P(t|S) = W_{St} + W_{Su}W_{ut}(1 - W_{St}),   \\
P(t|u) = W_{ut}   \\
\end{cases} $$

**Solution of above equation set:**

$$
P(t|S,u) = \begin{cases}
\frac{P(t|S)+P(t|u)-P(t|u)P(u|S)-P(t|S)P(t|u)}{1-P(t|u)P(u|S)}, P(u|t)=0 & P(t|u)P(u|S)<1  \\
1, P(t|u)P(u|S=1 \\
\end{cases}
$$

Actually, I have tried two kinds of Bayes calculation. 
    1. The first one is simple and only consider the situation of the third picture, which ignore the wtu. Let’s name it with **BayesCal1**.

$$
P(t|S, u) = \begin{cases}
P(t|S) + P(t|u) - P(t|u)P(u|S) - P(t|S)P(t|u),  P(t|u)P(u|S)<1 \\
1, P(t|u)P(u|S) = 1
\end{cases} \\
$$

    2. The second will consider the second situation with wtu may greater than 0. Let’s name it with **BayesCal2**.

$$ 
s = \sqrt{P(u|t)^2P(t|u)^2 + P(u|S)^2P(t|u)^2 + P(u|t)^2P(t|S)^2 +2P(u|t)P(t|u)\\
*(2P(t|S) + 2P(u|S) - P(t|S)P(u|S) - P(u|t)P(t|S) - P(t|u)P(u|S))\\
- 2(P(u|t)P(t|u) + P(t|u)P(u|S) + P(u|t)P(t|S)) + 1} , \\
$$ 

$$ 
P(t|S, u) = \begin{cases}
\frac{1-s-P(t|u)P(u|S)}{2P(u|t)}, P(u|t) \ne 0 \\
P(t|S) + P(t|u) - P(t|u)P(u|S) - P(t|S)P(t|u),  P(t|u)P(u|S)<1 \\
1, P(t|u)P(u|S) = 1
\end{cases} \\
$$

According to experiment, the total influence result that caculated by **BayesCal1** and **BayesCal2** has absolute error that less than **0.001**, while the time cost of **BayesCal2** is more than **20 times** of **BayesCal1**’s, so I only use BayesCal1 in my algorithm. The experiment also shows that the relative error between **BayesCal1** and **Monte Carlo Method** is always less than **0.030** in **IC** model. In **LT** model, the relative error is less than **0.022**. Thus, we can’t replace **Monte Carlo Method** with **BayesCal1** totally, but we can use **BayesCal1** when the time is very limited, because the **BayesCal1** can get any sizes of seeds in a quick time since the time complexity of it is $O(n)$ without any **Monte Carlo Method** after we get the sim.
     - Combining above ideas, we can get an **Approximate Bayes Inference** version of **CELF**: **CELF-AB**


----------


### **Algorithm CELF-AB**

 - Require: G, k
 - Ensure: seed set S
```
S ← ∅; Q ← ∅; cur_infl = 0; cur_infl_map = null;
for each u ∈ V do
	u.mg1 = L({u});  u.flag = 0.
	sim[u] = MonteCarlo([u])
 	Add u to Q. 
while |S| < k do
	u = top (root) element in Q.
	if u.flag == |S| then
		S ← S ∪ {u};Q ← Q − {u};
	    if u.flag == 0 then
			cur_infl_map = sim[u]
		else
			if time limited then
				cur_infl_map = BayesCal1(S)
			else
				cur_infl_map = MonteCarlo(S)
			cur_infl = sum(cur_infl_map)
			
 			continue;
	else
		if time limited then
		    u.mgl = sum(BayesCal1(S, u)) - cur_infl
		else
		    u.mgl = sum(MonteCarlo(S ∪ u)) - cur_infl
	u.flag = |S|; 
	Reinsert u into Q and heapify.
```


----------


Variable ***cur_infl_map*** is a possibility map like $sim[u]$, while it expand to any size of seeds $S$ and ***cur_infl*** is the influence of current seeds $S$.

----------

## 3. Empirical Verification
 - Environment:
     - Windows7-64bit, 8 cpu cores, Python2.7.14, pycharm, numpy
     - Data: NetHEPT.
 - $T(sim)$ means the time of getting the matrix sim which deponds on the graph size k and the round R of Monte Carlo Method. When R = 100, T(sim) is about 206s.
 - **CELF-ABC** means doing Approximate Bayes Inference **completely**. In this case, it do no Monte Carlo Method after getting $sim$, which is equal to **time limited** is true all the time.
### IC model:
| Size | k = 10 | k = 10 | k = 50 | k = 50 |
| --------   | -----:  | -----:  | -----:  | :----:  |
|	|Influence | Time| Influence | Time |
| CELF | $510$ | $T(sim) + 112s$ | $1280$ | $T(sim) + 3325s$ |
| **CELF-ABC** | $510$ | $T(sim) + 1s$ | $1273$ | $T(sim) + 1.3s$ |

### LT model:
| Size | k = 10 | k = 10 | k = 50 | k = 50 |
| --------   | -----:  | -----:  | -----:  | :----:  |
|	|Influence | Time| Influence | Time |
| CELF | $637$ |$T(sim) + 145s$ | $1671$ | $T(sim) + 4304s$ |
| **CELF-ABC** | $637$ | $T(sim) + 1s$ | $1657$ | $T(sim) + 1.2s$ |


According to the result, when the $k$ is small, the **CELF-ABC** behave as good as **CELF** but loss a little influence when the $k$ is bigger comparing with **CELF**. However, the time effiency of **CELF-ABC** is much better than the original **CELF** when the $k$ become bigger.

## 4. References
>[1] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. S. Glance. Cost-effective outbreak detection in networks. In *Proceedings of the 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pages 420–429, 2007.

>[2] Wei Chen, Yajun Wang, Siyu Yang.  Efficient Influence Maximization in Social Networks.