Curating a high quality, scientific pre-training corpus.