Data Pipeline



[1] Data Collection

Using Python Selenium and Beautiful Soup libraries to build web scraping tools, our team collected from the web over 600k unique resumes and 1.1 million salary records representing over 5500 cities across 30 years.

Link to Code


[2] Data Transformation

Data ingestion pipelines were created to standardize the data inputs from the various sources.


Jobs Title

Job titles vary widely from resume to resume, so we created a heuristic to process each job title into a standard format. To group different levels of the same job, we created a list of “experience qualifiers” for each job and removed those words from the job title. Experience qualifiers were manually determined and include words like Lead, Senior, Junior, I, and Level 2.

Transformation: Senior Software Engineer II['software engineer',['senior','2']]

Link to Code


Job Description

Job descriptions were preprocessed and used to create various classification models. Preprocessing included lemmatization, stop-word removal, special character removal, contraction altering, and tokenization.

Transformation: .• Balance daily workflow of prospecting, responding to leads, scheduling and conducting appointments and maintain phone-based client relationships .• Conduct dynamic sales presentations for potential customers .• Use product knowledge and sales skills to recommend suggested products or services to fulfill customers' needs.[' balance daily workflow prospecting responding lead scheduling conducting appointment maintain phonebased client relationship conduct dynamic sale presentation potential customer use product knowledge sale skill recommend suggest product service fulfill customer need ']

Link to Code


Education Title

We created a rules-based logic to separate degree type from degree subject and processed all education information.

Transformation: B.S. in Business Administration and Management[['business administration and management'],['bachelors']]

Link to Code


Salary

Salary information was cleaned and standardized.

Transformation: 50000/YR50000

Link to Code


Location

Locations were parsed into city and state.

Transformation: AKRON, OH[['akron'],['Ohio']]

Link to Code


[3] Next Steps

  1. Collect more data.

  2. Identify new methods to group job titles.

  3. Identify new methods to parse degree from subject.

  4. Categorize degrees as STEM or non-STEM.

Back to How It Works

TEST OUR PRODUCT

Access iCareer Climber