# PDB-analyze **Repository Path**: kahsolt/pdb-analyze ## Basic Information - **Project Name**: PDB-analyze - **Description**: 就是说,你弄啥呢? - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-01-07 - **Last Updated**: 2025-01-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # PDB-analyze Shallow data analysis on PDB (Protein Data Bank) database for play :( ---- ### Change Log - 2023/01/06: add scripts to process raw PDB data & cd-hit data - 2022/12/02: add [Research](#Research) section - 2022/11/18: add scripts to process ARIP data ### Quick Start ⚪ starting from ARIP processed data - put `*.csv` under `rdata` folder - check consts in `data.py` - run `python preprocess_arip.py` - run `python pca.py` and `python pca.py -M tsne` - run any algorithm - `knn.py`: kNN - `knn_color.py`: semi-supervised kNN - `bayes.py`: Naive Bayes - `lr.py`: Logistic Regression ⚪ starting from cd-hit data - put `Single.fas.clstr` and `Multi.fas.clstr` under `rdata` folder - run `python preprocess_cdhit.py` ⚪ starting from raw PDB data - uncompress `NMR_PDB.zip` under `rdata` folder - run `python preprocess_pdb.py` (~12min) - run `python plot_pdb.py` - run any app - `python mlm_seqs.py` - `python cls_seqs.py` - `python clf_cfms_flex.py` ### Research Points 考察多肽链 柔性的部分(易于曲折的) 和 稳定的部分(压不碎的) 都是由怎样的 结构(相互作用) 支撑起来的,是否显著差异 可视化的直觉: α-螺旋 和 β-片层 是稳定的,飘带模型中的细线条区域都是不稳定的 #### Target Problem 找出每个 Kind 中 最柔性flex/最稳定stab 的那些 Pair - 期望得到的结论: - D-R是稳定的 - Phob_D-R是稳定的 - 在Phob_D-R中,随着Dist从短到长,稳定性blabla - 以下结论没有意义: - flex_Arom_I_F-F_1MPE21_A33F-B26F是柔性的 #### Current Problem 把 2分类 变成 3分类 - 前情提要 - 新增了一列Prop3,人为划分了少量的稳定St、柔性Fl、其它Ot,剩下的是Undefined - 先抽样不含Undefined的,KNN预测正确率高达99.724%,这说明St、Fl、Ot的划分确实很有区分度……吧? 【正确的!】 - [*] 当下目标 - 试图使用不含Undefined的样本训练KNN,使它能预测所有的Undefined属于哪一类 - 实际上,可以预测,预测成功了一个包含0,1,2的大列表,但我不知道怎么整合预测结果,怎么去分析预测(比如用什么可视化方法) #### Division Basis 划分稳定(St)、柔性(Fl)和其它(Ot)的依据 V0.3 【您踏马的就是不会用 pd.groupBy() 是罢?】 - St: Ave 0.4~0.7 and CV <= 0.1 and Nump >= 0.9 - Ave >= 0.8 and Nump >= 0.9 - Fl: Ave <= 0.3 and CV >= 0.5 and Nump <= 0.6 - Ot: Ave 0.4~0.7 and CV 0.3~0.4 or Nump 0.7~0.8 - Ave >= 0.8 and Nump <= 0.6 - Ave <= 0.3 and CV >= 0.5 and Nump >= 0.9 - Ave <= 0.3 and CV <= 0.1 ### PDB basic info ``` >> found 12156 pdb files >> found 12108 structures // 构象系 >> found 167803 models // 构象 (单位客体) >> found 196567 chains (conformations) // 链 min len: 4 max len: 828 avg len: 86.80135526309095 std len: 51.524116011982684 >> found 8352 uniq seqs // AA序列 ``` ### requirements - PDB database - [World Wide PDB](http://www.wwpdb.org/) - [RCSB](https://www.rcsb.org/) - Biopython: [biopython](https://biopython.org/) - site: [https://biopython.org/](https://biopython.org/) - doc: [https://biopython-cn.readthedocs.io/](https://biopython-cn.readthedocs.io/) - PDB visualize: [chimera](http://www.cgl.ucsf.edu/chimera/) ---- 2022/11/18 by Armit