Discovery biologics requires benchmark dataset to build a robust model. A good benchmark dataset generation requires multiple steps including data cleaning and precise data split. Therefore, there is a need of automated data science driven method that would include following functions:

1. Data cleaning including outlier detection
2. Data split should be considered in benchmark dataset generation to avoid data leakage
     a. Distance matrix of data points (encoding sequence representation)
     b. Identify similar data point using clustering and/or dimensionality reduction algorithm
     c. Statistical analysis to assess skewness
3. Visual interpretation

Steps mentioned above can be performed in multiple ways therefore we propose to develop multiple methods and compare them to select the best method in terms of efficiency and data leakage while splitting.

Final deliverable of the project would include above steps and user friendly visualization in jupyter notebook. API is preferred.

Benchmark Dataset Generation for Discovery Biologics - Spring 2023 Discovery Project
Term
Spring 2023
Topic
Data Visualizations
Industry/Economics
Technical Area(s)
Machine Learning (ML)