Data Pipeline
By iCareer Climber Team on 6 July 2018
[1] Data Collection
Using Python Selenium and Beautiful Soup libraries to build web scraping tools, our team collected from the web over 600k unique resumes and 1.1 million salary records representing over 5500 cities across 30 years.
[2] Data Transformation
Data ingestion pipelines were created to standardize the data inputs from the various sources.
Jobs Title
Job titles vary widely from resume to resume, so we created a heuristic to process each job title into a standard format. To group different levels of the same job, we created a list of “experience qualifiers” for each job and removed those words from the job title. Experience qualifiers were manually determined and include words like Lead, Senior, Junior, I, and Level 2.
Transformation: Senior Software Engineer II
→ ['software engineer',['senior','2']]
Job Description
Job descriptions were preprocessed and used to create various classification models. Preprocessing included lemmatization, stop-word removal, special character removal, contraction altering, and tokenization.
Transformation: .• Balance daily workflow of prospecting, responding to leads, scheduling and conducting appointments and maintain phone-based client relationships .• Conduct dynamic sales presentations for potential customers .• Use product knowledge and sales skills to recommend suggested products or services to fulfill customers' needs.
→ [' balance daily workflow prospecting responding lead scheduling conducting appointment maintain phonebased client relationship conduct dynamic sale presentation potential customer use product knowledge sale skill recommend suggest product service fulfill customer need ']
Education Title
We created a rules-based logic to separate degree type from degree subject and processed all education information.
Transformation: B.S. in Business Administration and Management
→ [['business administration and management'],['bachelors']]
Salary
Salary information was cleaned and standardized.
Transformation: 50000/YR
→ 50000
Location
Locations were parsed into city and state.
Transformation: AKRON, OH
→ [['akron'],['Ohio']]
[3] Next Steps
Collect more data.
Identify new methods to group job titles.
Identify new methods to parse degree from subject.
Categorize degrees as STEM or non-STEM.