Structure data for import
July 31, 2014
Real world data can come in a tremendous variety of shapes and sizes. Protobi can currently import data in two standard forms:
- Comma-separated values (CSV)
- SPSS (.sav)
Depending on the form of your data, it might take a bit of prep to shape the data to streamline your work in Protobi. In the professional plans, our expert analysts can typically take care of this for you. If you'd like to do it on your own, here's what you need to know.
General structure
Protobi can view "rectangular flat files", which are tables with one column per field, one row per record. Think of a single spreadsheet with column headers, and no random tables or calculations off to the side.Every column has to have a unique name, which Protobi uses as the "key" to look up data values internally, and as the default title for elements.
Transformations -- joining tables, deleting rows, defining new columns -- generally you'll do that externally in other software like Excel, R, SAS or SPSS.
Comma separated values
Protobi accepts CSV data in IETF RFC-4180 standard form, a widely accepted standard. In a nutshell, this means:- Rows end with line breaks
- Values are separated by commas
- Values containing commas, line breaks or double quotes must be surrounded by double quotes
- Double quotes are escaped as two double quotes (eg.
""
) - The first row has column names (separated by commas)
E.g.
id,name
1,Alice
2,Bob
3,"John ""Jack"" Smith, Jr."
Protobi will interpret the column header as the variable name or "key".
In Excel, you can select "Save as..." from the file menu, and choose "Comma Separated Values (.csv)".
CSV files are great in that nearly any software package can export tables in CSV format. However, they are a challenge because they don't easily contain other information useful for analysis, like variable labels and value formats.
SPSS Data file
SPSS files are great because they include not only the data, but also includes metadata such as variable labels and value formats. The metadata is useful to translate cryptic raw data values into human-readable formats. Most survey vendors and software packages can export data as SPSS data files with embedded formats.For instance in a typical survey we may get a field named q1 and may have the the label "Q1. What is your age?" That question may have a set of radio buttons, corresponding to age ranges, which may be coded in the data as:
1
= "Under 18"2
= "18 to 34"3
= "35 to 50"4
= "51 to 65"5
= "Over 65"
SPSS also allows you to specify a field as the weight field, and Protobi will use that to count each record. SPSS also allows you to specify multiple-response sets, which Protobi will recognize and intelligently condense.
Complex structure
Sometimes questions in a survey are grouped intelligently. Such as a Q13. Pets? Check all that apply:
a. Dog
b Cat
c. Other
Such as file may be encoded with a field for each option:
"q13a: Dog","q13b: Cat","q13c
1,0,0
0,0,0
0,1,1
...
Neither SPSS nor CSV, (or any other major standard) seems to encode this structure well.
In the SPSS data file, survey engines often export the data with value labels that encode the metadata, as:
Variable | Label | Format |
---|---|---|
q13a | "Q13a. Dog: Pets? Check all that apply" | 1="Yes", 2="No" |
q13b | "Q13b. Cat: Pets? Check all that apply" | 1="Yes", 2="No" |
q13c | "Q13c. Other: Pets? Check all that apply" | 1="Yes", 2="No" |
Protobi can recognize this pattern, and extract the common text into the parent group using the Extract Common Text feature.