Technical Details
Specific Data Collected
We are interested in the unwanted traffic that scanners and attackers send to your closed ports. We are not interested in legitmate traffic between you and your users. As a result, we have specifically designed LightScope not to collect any "wanted" traffic, or any information that may identify you or your users. For example, if you are running a web server on port 80, LightScope will see that port 80 is open and not collect any data going to or from that port. Our study has passed IRB approval verifying our collection and storage methods (certified exempt), as study UP-25-00124 --- LightScope - Survey of unwanted traffic to large user populations to the University of Southern California Institutional Review Board. Specifically, We collect the following fields from packets without modification, which cannot be used to identify participants in our study:
Time
Timestamp information for temporal analysis.
IPv4 Fields
Which will not allow us to identify participants, and include:
- Version
- IHL (Internet Header Length)
- TOS (Type of Service)
- Len (Length)
- ID (Identification)
- Flags
- Frag (Fragment)
- TTL (Time to Live)
- Proto (Protocol)
- Chksum (Checksum)
- Src (Source)
- Options (including padding)
TCP Fields
Which will not allow us to identify participants, and include:
- Sport (Source Port)
- Dport (Destination Port)
- Seq (Sequence Number)
- Ack (Acknowledgment Number)
- Dataofs (Data Offset)
- Reserved
- Flags
- Window
- Chksum (Checksum)
- Urgptr (Urgent Pointer)
- Options (including padding)
IPv6 Fields
Which are required to analyze the newest traffic protocol. These fields will not allow us to identify participants, and include:
- Version
- Traffic class
- Flow label
- Payload length
- Next header
- Hop Limit
- Source address
UDP Fields
Which are required to analyze UDP traffic. These fields will not allow us to identify participants, and include:
- Source Port
- Destination Port
- Length
- Checksum
ICMP Messages
Internet Control Message Protocol messages, which can indicate which UDP ports are closed, and can be another type of unwanted traffic in themselves. These messages will not allow us to identify participants.
ARP Messages
Address Resolution Protocol messages, which will allow us to infer which machines are on the local network, and which machines are remote. These messages will not allow us to identify participants.
Anonymization and Privacy
IP Destination Field Anonymization
We also collect and anonymize the IP destination field from the packets, as it could be used to identify participants in our study if captured without modification. We perform anonymization by randomizing the IPs in a consistent manner where the IP addresses have a 1 to 1 mapping with the anonymized values, but we cannot reverse this anonymization.
Network Type and Country Inference
We infer the Network Type and Country of the participants from their packets. This information cannot be used to identify participants in our study.
System Information
Machine Information
Information about the user's machine, which will help us determine the most likely use of the machines enrolled in our study. An example would be a web server vs a laptop, which we expect to have significantly different profiles. We should also be able to deduce if the user's machines are subject to network address translation, which should also significantly impact the observed traffic. These fields will not allow us to identify participants, and include:
- System Info (Operating system)
- Release Info
- Version Info
- Machine Info
- Total memory
- Processor
- Architecture
- Ports open
- Network interfaces used
- If their internal IP is private
- If their external IP is private
- Time
- Number of TCP packets LightScope inspected
Network Classification
Lastly, we infer ASN, ASN type, and city of the participants from their network traffic. This information helps us categorize the type of network we are monitoring, which we believe should have an impact on the amount and type of traffic we observe. This information cannot be used to identify participants in our study.

This material is based upon work supported by the U.S. National Science Foundation under Grant No. 2313998. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the U.S. National Science Foundation.