Data Pipeline

By iCareer Climber Team on 6 July 2018

[1] Data Collection

Using Python Selenium and Beautiful Soup libraries to build web scraping tools, our team collected from the web over 600k unique resumes and 1.1 million salary records representing over 5500 cities across 30 years.

Link to Code

[2] Data Transformation

Data ingestion pipelines were created to standardize the data inputs from the various sources.

Jobs Title

Job titles vary widely from resume to resume, so we created a heuristic to process each job title into a standard format. To group different levels of the same job, we created a list of “experience qualifiers” for each job and removed those words from the job title. Experience qualifiers were manually determined and include words like Lead, Senior, Junior, I, and Level 2.

Transformation: Senior Software Engineer II → ['software engineer',['senior','2']]

Link to Code

Job Description

Job descriptions were preprocessed and used to create various classification models. Preprocessing included lemmatization, stop-word removal, special character removal, contraction altering, and tokenization.

Transformation: .• Balance daily workflow of prospecting, responding to leads, scheduling and conducting appointments and maintain phone-based client relationships .• Conduct dynamic sales presentations for potential customers .• Use product knowledge and sales skills to recommend suggested products or services to fulfill customers' needs. → [' balance daily workflow prospecting responding lead scheduling conducting appointment maintain phonebased client relationship conduct dynamic sale presentation potential customer use product knowledge sale skill recommend suggest product service fulfill customer need ']

Link to Code