# ToolSeg

**Repository Path**: w3STeam/ToolSeg

## Basic Information

- **Project Name**: ToolSeg
- **Description**: We propose such toolbox and implement it an open source Java package called ToolSEG to generate realistic DNA copy number profiles with known truth. Five typical method has been assessed on synthetic data and real copy number profiles: HMM, CBS, PCF, Lasso, DBS. 
- **Primary Language**: Java
- **License**: Apache-2.0
- **Default Branch**: Justin
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2016-06-17
- **Last Updated**: 2022-05-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ToolSeg 1.2 Software User Manual

## 1. Introduction

ToolSeg is an open-source, cross-platform Java application that integrates simulation data generation and segmentation achievement. The methods in the system include HMM, PCF, FastPCF, CBS, CLT and Lasso. It not only can be used for comparison between the methods, as well as meeting the needs of the actual segmentation. Most importantly, we will be staging structured. Each segment corresponding method parameters can be adjusted to facilitate users find problems. The input to be segmented can be the data generated by users or the input data that has the certain format. To meet with diverse needs and applications in the cancer research community, ToolSeg not only output .txt file of segmentation result that can be used for further analysis, such as estimates purity and ploidy, but also outputs graphical results that help intuitive understanding of the effect of segmentation. The software and source code of ToolSeg could be downloaded from <https://gitee.com/w3STeam/ToolSeg>. An example of use is provided.

## 2. Software System Requirement

The Java Runtime Environment (JRE) version 1.8 or higher is required to be
installed and configured properly. JRE could be downloaded at
<http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html>

## 3. Installation 

**Step 1:** Download a Java IDE and an up-to-date Java Development Kit (JDK).
Java SE 1.8 could be downloaded at
<http://www.oracle.com/technetwork/java/javase/downloads/index.html>, as is
shown in Figure 1.

**Step 2:** Install both software using the installation wizards. It is
recommended to install the JDK first.
![](media/Image1.png)
Figure 1.Java SE download

**Step 3:**  Download two Zip packages ToolSeg.zip and test_data.zip from <https://gitee.com/w3STeam/ToolSeg/attach_files>. Then unzip these files to a local folder.

**Step 4:** Now, enter the folder and click the "start.bat " to run the program in Windows, as is shown in
Figure 2. (or run command “java -jar ToolSeg.jar” in Linux.)

![](media/Image2.png)

Figure 2.Run the project

**Step 5:**
Fast run. 

1. select an input path
2. load data
3. select an aogorithm
4. segment

![](media/fast-run.png)
## 4. Usage 

![](media/Image3.png)

Figure 3.The GUI

The GUI contains two modules：Input Selection, Algorithms, as is shown in Figure 3.

### 1) Input Selection Module
The first module is “Input Selection”, where user can choose the input data they need.

There are two data sources users can choose.

The first data source is “Data Generate”, by which, users can generate datas
they like. They can generate datas by random, by multi-template or by
template shift. They can set the datas’ format. As shown in the figure 8,we
set a data that has four segments. Their lengths are respectively 300, 300,
300, 300. Their segment values are respectively 1, 2, 3, 4. Their segment
variances are respectively 0.4, 0.4, 0.4, 0.4. And their contaminations are
all 0.5. Click “Generate”, the data will be generated,as shown in figure 4.

Click “Show”, we can see the figure, as shown in Figure 6. Click “Clear”,
the data generated will be cleared. Click “Save”, the data genereted will be
saved as .txt file in the path “..//data”.

![](./media/image8.png)

Figure 4. Input Selection Module

![](./media/image9.png)

Figure 5.Data generated

![](./media/image10.png)

Figure 6. Data’s figure

Another source is the data input by users. Users can choose the input path
and the output path, as shown in figure 7. They also can choose whether the
input data is used to be tested. The input file should be .txt file like
this way:

>   chrId 	Loci		 CNValue 
>
>   1 		0 		1.123846

>   1		1 		1.057979

>   1		2 		1.061475.....

chrId is the serial number of Chromosome. Loci is the position of copy
number. CNValue is the copy number value.

![g](media/a5f7312b657054664db3ca2c50c92300.png)

Figure 7.Input data


### 2)  Algorithms Module

Another module is “Algorithms”, where user can choose the algorithm they
like, as is shown in Figure 8.

![](media/66d0c79b437d8fb856bc66a295ed141d.png)

Figure 8.Algorithms Module

The top is “Preprocess”, the left of which is “outliers handling”. You can
set the range of the “outliers handling” and choose whether you want to do
“outliers handling”. The right is “Transformation”. You can choose only one
of the three transformations. “Bypass” means that you don’t want to
transform the data and process directly. “Log2” and “Pow2” respectively mean
that the data will be transformed by taking a logarithm of 2 and 2 times
square.

The bottom is algorithms. There are seven algorithms that can be chosen,
“CBS”, “PCF”, “FastPCF”, “BACOM”, “HMM”, “Lasso” and “DBS”, all of which
have their corresponding parameters that can be set. Click an algorithm, and
you can set corresponding parameters.

**For CBS and BACOM:** As shown in Figure 9 and 10, “minLength” means the
minimum length of segment. It can influence the speed and accuracy of the
algorithm. While minLength is smaller, the result is more accurate. While it can
slow down the speed of algorithm. “minStep” means The step length when
traversing the copy number sequence. For CBS, it involves permutation test, so
user can adjust “permutation times”. For BACOM, it involves Center Limit Theory,
“pvalue“ is the threshold to judge whether isolates the theory.

![](media/a6d71d8c3211d16131d5979b1d23ece2.png)
Figure 9.CBS
![](media/54c84470e88f7d8415eec24f0521628a.png)
Figure 10.BACOM

**For PCF and FastPCF:** As shown in Figure 11 and 12, “gama” stands for the
fixed penalty. PCF utilizes penalized least squares regression to determine a
piecewise constant fit to the data, introducing a fixed penalty γ \> 0 for any
difference in the fitted values of two neighboring observations induces an
optimal solution of particular relevance to copy number data.

![](media/9bc0e16598d63d20a781f12b5007fc23.png)
Figure 11. PCF 
![](media/3f1f405f80c2235ac3b0a01742448bf6.png)
Figure 12. FastPCF

**For HMM:** As shown in Figure 13, “Center Probability” stands for the
probability that in transition probability matrix, a state to maintain existing
state probability. Probability are divided into ten parts. You just have to set
center probability takes how much.

![](media/8328d95ce233f26972d54122dca38b5b.png)
Figure 13.HMM

**For Lasso:** As shown in Figure 14, “lamda” is a constant for L1 regression.
“tolerance” means when the difference of result of twice calculation is less
than tolerance, calculation finish. “maxIter” is the number of iterations of the
algorithm.

![](media/9a90a93238e2067025800a2aa023799d.png)
Figure 14.Lasso

**For DBS:** As shown in Figure 15, “minLength” means the minimum length of
segment. “pvalue“ is the threshold to judge whether isolates the theory. “lamda”
is a constant for L1 regression.

![](media/54f96151d2e5f3a025e4d565f54804f8.png)
Figure 15.DBS

When the data, the algorithm and their parameters all are prepared, you can
choose “Segment” or “TestAll”, as is shown in Figure 16. “Segment” means
segmenting by current algorithm, and “TestAll” means segmenting by all
algorithms.

![](media/107cb127f8a77aa09bf696abe5ea20dd.png)
Figure 16.Segment and TestAll

After segmented, the result will be saved as a .png picture for every
algorithm in the path ..//Pictures. For example, the result of HMM is shown
in Figure 17. When “TestAll”, the results of all algorithms will also be
saved as a .png picture in the path ..//Result, as is shown in Figure 18.
All results will be shown in the console, as is shown in Figure 19.

![](./media/image21.png)
Figure 17.Picture of HMM

![](./media/image22.png)
Figure 18. Picture of all results

![](./media/image23.png)
Figure 19.Picture of console

### 3）Batch processing

When selecting the data input by users, users can choose whether to batch process, as shown in figure 24. When users choose to batch process, the input file should be .txt file and the content should be like this way:
>   .\data\testdata2\LTestdata_20161026_151708.txt
>
>   .\data\testdata2\LTestdata_20161026_152635.txt

>   .\data\testdata2\R_Testdata_20161026_151627.txt

>   .\data\testdata2\R_Testdata_20161026_153220.txt

Here, each .txt file must be one of the above input files.
![](./media/image24.png)
Figure 20.Picture of batch processing