Language Data Scientist II, AWS AI Data | Transcribe
About the position
Responsibilities
• Translate business, modeling and ethical requirements in Health AI into executable data collection projects.
• Design human-in-the-loop evaluation tasks to measure the performance and usability of models in the medical domain.
• Develop the materials necessary to execute successful data collection efforts such as guidelines, annotation interfaces, quality assurance workflows.
• Support the sourcing and/or creation of high-quality language datasets and language artifacts for feature and language expansion.
• Analyze structured and unstructured data to provide actionable recommendations to improve data quality or model performance.
• Iterate and innovate on data collection methodologies to improve data turnaround time and reliability.
• Incorporate LLMs, prompt engineering, and ML techniques to automate repetitive annotation and data creation workflows.
Requirements
• 2+ years of data scientist experience.
• 3+ years of data querying languages (e.g. SQL), scripting languages (e.g. Python) or statistical/mathematical software (e.g. R, SAS, Matlab, etc.) experience.
• PhD in a language and human behavior related field with a strong quantitative component (e.g., Cognitive Linguistics, Sociolinguistics, Human-Computer Interaction); or, a Master's degree with 3+ years of field experience.
• Experience in data mining and cleaning for NLP machine learning model pipelines.
• Experience in language data collection for quantitative analysis, including guidelines, workflow design.
• Experience in research and experimental design involving human participants.
• Experience in statistical measures for data quality assessment and research hypotheses testing.
• Practical knowledge of data labeling tools and techniques (e.g., Amazon SageMaker Ground Truth, brat, ELAN).
• Excellent knowledge of semantics, pragmatics, conversation analysis, and/or discourse analysis.
• Ability to explain complex concepts and solutions in easy-to-understand terms.
Nice-to-haves
• Experience with LLMs and prompt engineering techniques and other programmatic approaches to annotation, including weak supervision and active learning.
• Practical knowledge of version control systems (e.g. Git).
• Experience with spoken data collection, speech analysis, speech transcription (from scratch or ASR-assisted).
• Experience working with clinical or medical data, such as medical transcriptions, clinical notes, or electronic health records (EHRs).
• Knowledge of healthcare terminology and medical ontologies (e.g., SNOMED CT, ICD, RxNorm).
Benefits
• Medical, financial, and/or other benefits including equity and sign-on payments.
• Flexible working culture to support work-life balance.
• Mentorship and career growth resources.
• Employee-led affinity groups fostering a culture of inclusion.
Apply tot his job
Apply To this Job