From survey participants to agents
In the social sciences, truly relational datasets are relatively scarce, while individual-centered survey data are far more common. Pop2net provides methods for generating networks from such survey data by first creating actors from survey participants and then connecting them based on their empirical attributes. This approach is particularly effective with datasets that include quasi-network information—for example, household membership data found in surveys like the German Socio-Economic Panel (SOEP).
The following basic example illustrates how Pop2net can be used to construct networks from survey data.
Let’s begin by creating an artificial example dataset:
[1]:
import pop2net as p2n
from pop2net.data_fakers.soep import soep_faker
df_soep = soep_faker.soep(size=100, seed=1)
df_soep.head(20)
[1]:
age | gender | work_hours_day | nace2_division | hid | pid | |
---|---|---|---|---|---|---|
0 | 20.0 | male | 0.000000 | -2 | 2201 | 583 |
1 | 44.0 | female | 0.000000 | -2 | 2201 | 868 |
2 | 32.0 | female | 0.000000 | -2 | 1033 | 262 |
3 | 98.0 | female | 0.000000 | -2 | 1033 | 121 |
4 | 95.0 | female | 12.613089 | 86 | 8117 | 780 |
5 | 38.0 | male | 0.000000 | -2 | 8117 | 461 |
6 | 52.0 | male | 6.528070 | 64 | 8117 | 484 |
7 | 80.0 | female | 0.000000 | -2 | 6219 | 808 |
8 | 44.0 | male | 1.889185 | 88 | 6219 | 215 |
9 | 81.0 | female | 0.000000 | -2 | 6219 | 97 |
10 | 4.0 | male | 0.000000 | -2 | 6219 | 500 |
11 | 48.0 | male | 0.000000 | -2 | 464 | 915 |
12 | 20.0 | female | 5.524712 | 86 | 464 | 856 |
13 | 65.0 | female | 0.000000 | -2 | 464 | 400 |
14 | 66.0 | male | 0.000000 | -2 | 464 | 444 |
15 | 85.0 | male | 0.000000 | -2 | 464 | 623 |
16 | 32.0 | female | 0.000000 | -2 | 34 | 713 |
17 | 49.0 | female | 7.113968 | 84 | 34 | 457 |
18 | 42.0 | male | 0.000000 | -2 | 34 | 273 |
19 | 97.0 | female | 0.000000 | -2 | 3748 | 606 |
[2]:
env = p2n.Environment()
creator = p2n.Creator(env)
Using the creator, we can sample from the dataset above. While doing so, we can specify a column to serve as the sampling unit. In this example, we use the household identifier “household_id” as the sample_level. Additionally, it is possible to oversample the data, meaning we can generate more sampled actors than there are original rows. Here, we set the number of samples to 500.
[3]:
df_sample = creator.draw_sample(
df=df_soep, # set the dataset to be sampled from
n=500, # set the number of lines to be sampled
sample_level="hid", # set the unit of sampling
sample_weight=None, # if needed, set a column as the sampling weight
)
df_sample.head(20)
[3]:
age | gender | work_hours_day | nace2_division | hid | pid | hid_original | |
---|---|---|---|---|---|---|---|
0 | 95.0 | female | 12.613089 | 86 | 1 | 780 | 8117 |
1 | 38.0 | male | 0.000000 | -2 | 1 | 461 | 8117 |
2 | 52.0 | male | 6.528070 | 64 | 1 | 484 | 8117 |
3 | 36.0 | female | 0.000000 | -2 | 2 | 562 | 6014 |
4 | 65.0 | female | 9.102887 | 99 | 2 | 720 | 6014 |
5 | 38.0 | other | 6.834517 | 86 | 2 | 795 | 6014 |
6 | 51.0 | male | 8.000000 | 69 | 2 | 691 | 6014 |
7 | 53.0 | female | 0.000000 | -2 | 2 | 756 | 6014 |
8 | 28.0 | female | 3.963627 | 84 | 2 | 384 | 6014 |
9 | 97.0 | female | 0.000000 | -2 | 3 | 606 | 3748 |
10 | 2.0 | female | 0.000000 | -2 | 3 | 968 | 3748 |
11 | 32.0 | male | 0.000000 | -2 | 4 | 450 | 1416 |
12 | 37.0 | female | 0.000000 | -2 | 4 | 680 | 1416 |
13 | 61.0 | male | 5.049022 | 64 | 4 | 521 | 1416 |
14 | 53.0 | male | 8.000000 | 99 | 5 | 945 | 9116 |
15 | 5.0 | male | 0.000000 | -2 | 5 | 658 | 9116 |
16 | 53.0 | male | 0.000000 | -2 | 6 | 924 | 1674 |
17 | 90.0 | male | 0.000000 | -2 | 6 | 326 | 1674 |
18 | 67.0 | male | 0.000000 | -2 | 7 | 520 | 6915 |
19 | 49.0 | female | 0.000000 | -2 | 7 | 850 | 6915 |
If you look closely, the method used overwrites the column specified as the sample level (e.g., “hid”) with new, unique values. The original values are preserved in a separate column (e.g., “hid_original”). This ensures that the sample level can still be used to connect actors, without introducing artifacts caused by duplicated samples. If desired, this behavior can be disabled by setting the replace_sample_level_column argument
to False
.
Now we create the actors from the sampled data:
[4]:
_ = creator.create_actors(df=df_sample)
Now let’s use two types of locations (households and work places) to connect the created actors based on their empirical attributes:
[5]:
class Household(p2n.LocationDesigner):
def split(self, actor):
"""Build one household location for each hid."""
return actor.hid
class Work(p2n.LocationDesigner):
n_actors = 10 # set the number of actors per work place to 10
def split(self, actor):
"""Build work places separated by industry."""
return actor.nace2_division
def filter(self, actor):
"""Connect only actors with valid industry codes."""
return actor.nace2_division > 0
def weight(self, actor):
"""Weight the connection between the actor and the location
by the empirical working time."""
return actor.work_hours_day
[6]:
_ = creator.create_locations(location_designers=[Household, Work])
[7]:
inspector = p2n.NetworkInspector(env)
inspector.plot_networks(location_color="label")